import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Read in the data¶
songs_url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv'
df = pd.read_csv( songs_url )
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32833 entries, 0 to 32832 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 track_id 32833 non-null object 1 track_name 32828 non-null object 2 track_artist 32828 non-null object 3 track_popularity 32833 non-null int64 4 track_album_id 32833 non-null object 5 track_album_name 32828 non-null object 6 track_album_release_date 32833 non-null object 7 playlist_name 32833 non-null object 8 playlist_id 32833 non-null object 9 playlist_genre 32833 non-null object 10 playlist_subgenre 32833 non-null object 11 danceability 32833 non-null float64 12 energy 32833 non-null float64 13 key 32833 non-null int64 14 loudness 32833 non-null float64 15 mode 32833 non-null int64 16 speechiness 32833 non-null float64 17 acousticness 32833 non-null float64 18 instrumentalness 32833 non-null float64 19 liveness 32833 non-null float64 20 valence 32833 non-null float64 21 tempo 32833 non-null float64 22 duration_ms 32833 non-null int64 dtypes: float64(9), int64(4), object(10) memory usage: 5.8+ MB
df.shape
(32833, 23)
The df DataFrame has 32,833 rows and 23 columns.
Display the variable names and their associated data types¶
df.dtypes
track_id object track_name object track_artist object track_popularity int64 track_album_id object track_album_name object track_album_release_date object playlist_name object playlist_id object playlist_genre object playlist_subgenre object danceability float64 energy float64 key int64 loudness float64 mode int64 speechiness float64 acousticness float64 instrumentalness float64 liveness float64 valence float64 tempo float64 duration_ms int64 dtype: object
Display the number of missing values for each variable¶
df.isna().sum()
track_id 0 track_name 5 track_artist 5 track_popularity 0 track_album_id 0 track_album_name 5 track_album_release_date 0 playlist_name 0 playlist_id 0 playlist_genre 0 playlist_subgenre 0 danceability 0 energy 0 key 0 loudness 0 mode 0 speechiness 0 acousticness 0 instrumentalness 0 liveness 0 valence 0 tempo 0 duration_ms 0 dtype: int64
There are three variables that have missing values: track_name, track_artist and track_album_name. Each of these variables is missing 5 values.
Display the number of unique values for each variable¶
df.nunique()
track_id 28356 track_name 23449 track_artist 10692 track_popularity 101 track_album_id 22545 track_album_name 19743 track_album_release_date 4530 playlist_name 449 playlist_id 471 playlist_genre 6 playlist_subgenre 24 danceability 822 energy 952 key 12 loudness 10222 mode 2 speechiness 1270 acousticness 3731 instrumentalness 4729 liveness 1624 valence 1362 tempo 17684 duration_ms 19785 dtype: int64
Cleaning the dataset¶
According to the Tidy Tuesday github page track_id is a unique song ID. But we can see that there are less unique values for track_id than rows in the dataset.
df.track_id.value_counts().value_counts()
count 1 25190 2 2384 3 510 4 142 5 60 6 35 7 17 8 15 9 2 10 1 Name: count, dtype: int64
We can see that some track ids have from 2 to 10 rows associated with them. Further exploration (not shown in this notebook for brevity) reveals that each row represents a track from an album within a specific playlist - meaning duplicates arise from the same track appearing in multiple playlists.
Now I will check whether the duplication of the track_id is associated with changes in the output, track_popularity, and inputs of interest. We group the data by track_id and count the number of unique values for each feature of interest - including both the target track_popularity and the input features we might use in our model.
df.groupby(['track_id']).\
aggregate(num_track_pop_values = ('track_popularity', 'nunique'),
num_playlist_genre_values = ('playlist_genre', 'nunique'),
num_playlist_subgenre_values = ('playlist_subgenre', 'nunique'),
num_danceability_values = ('danceability', 'nunique'),
num_energy_values = ('energy', 'nunique'),
num_key_values = ('key', 'nunique'),
num_loudness_values = ('loudness', 'nunique'),
num_mode_values = ('mode', 'nunique'),
num_speechiness_values = ('speechiness', 'nunique'),
num_acousticness_values = ('acousticness', 'nunique'),
num_instrumentalness_values = ('instrumentalness', 'nunique'),
num_liveness_values = ('liveness', 'nunique'),
num_valence_values = ('valence', 'nunique'),
num_tempo_values = ('tempo', 'nunique'),
num_duration_values = ('duration_ms', 'nunique')).\
reset_index().\
nunique()
track_id 28356 num_track_pop_values 1 num_playlist_genre_values 5 num_playlist_subgenre_values 10 num_danceability_values 1 num_energy_values 1 num_key_values 1 num_loudness_values 1 num_mode_values 1 num_speechiness_values 1 num_acousticness_values 1 num_instrumentalness_values 1 num_liveness_values 1 num_valence_values 1 num_tempo_values 1 num_duration_values 1 dtype: int64
We observe that some tracks appear with up to 5 different playlist_genre values and up to 10 different playlist_subgenre values. This is expected, since the same track can be included in multiple playlists with different themes or categorizations.
To ensure each row in the dataset represents a unique track, I chose to filter the original dataset to include only tracks that appeared once and only once and created a new dataset df_clean.
track_counts = df.track_id.value_counts()
unique_track_ids = track_counts[track_counts == 1].index
df_clean = df[df.track_id.isin(unique_track_ids)]
df_clean.shape
(25190, 23)
df.shape
(32833, 23)
(32833 - 25190) / 32833
0.23278408917857035
I chose to remove all duplicate tracks, which reduced the dataset size by about 23%. The main benefit was making the data much simpler to work with, as I didn't have to figure out how to handle cases where the same song appeared in conflicting playlists (for example, a song listed in both a pop and a rock playlist). The primary cost, however, is the reduction in data size, which could slightly decrease the model's statistical power. I determined that the value of a simpler, more interpretable dataset was worth the trade-off of a smaller sample size.
I'm treating the variable mode as a non-numeric column because it has low number of unique values - 2 - and because those numbers represent categories, not quantities. I will also treat column key as a non-numeric column for the same reasons.
df_copy=df_clean.copy()
df_copy['key'] = df_copy['key'].astype('object')
df_copy['mode'] = df_copy['mode'].astype('object')
Here I transformed the original target variable track_popularity into an object type variable binary_outcome which will have 2 unique values: 1 if the song's popularity is over 50, and 0 overwise.
df_copy['binary_outcome'] = np.where( df_copy.track_popularity > 50, 1, 0 )
df_copy['binary_outcome'] = df_copy['binary_outcome'].astype('object')
b. Visualization¶
1. Counts of categorical variables¶
sns.catplot(data=df_copy, x='playlist_genre', kind='count', hue='playlist_genre', aspect=1)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The bar chart above shows the distribution of songs across six different playlist genres in the cleaned dataset. While the genres are relatively balanced, rap stands out as the most represented genre.
sns.catplot(data=df_copy, x='mode', kind='count', hue='mode')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
While the distribution of songs across the two modes (1 - major, 0 - minor) is also relatively balanced, mode 1 is slightly more prevalent.
sns.catplot(data=df_copy, x='key', kind='count', hue='key')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The distribution of songs across the 12 musical keys is relatively balanced, with the exception of key 3, which has the fewest songs. Keys 0, 1, and 7 are the most common in the cleaned dataset.
sns.catplot(data=df_copy, x='binary_outcome', kind='count', hue='binary_outcome')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
This bar chart shows that our cleaned dataset contains roughly twice as many unpopular songs (where track_popularity is lower than 50) as popular ones.
df_copy.binary_outcome.value_counts(normalize=True)
binary_outcome 0 0.672092 1 0.327908 Name: proportion, dtype: float64
More than 67% of the songs in the dataset are classified as unpopular (using a 50% threshold)!
2. Distributions of continuous variables¶
To visualize marginal distributions for all continuous variables at once, I will first transform the database into the long format, which is preferred by Seaborn, and call it df_lf.
df_features = df_copy.select_dtypes('number').copy()
df_objects = df_copy.select_dtypes('object').copy()
id_cols = ['rowid'] + df_objects.columns.to_list()
df_lf = df_copy.reset_index().\
rename(columns={'index': 'rowid'}).\
melt(id_vars=id_cols, value_vars=df_features.columns)
I will now create density plots for all continuous variables.
sns.displot(data=df_lf, x='value', col='variable', kind='kde', col_wrap = 3,
facet_kws={'sharex':False, 'sharey':False},
common_norm=False)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
None of the distributions appear perfectly symmetric or unimodal - several exhibit multiple peaks, while others have long tails, indicating skewness. This suggests that the continuous variables deviate from a Gaussian distribution. Among them, the distributions of danceability, valence, energy, and duration are the most Gaussian-like.
3. Relationships between continuous variables¶
The pairsplot below shows the relationship between all continious variables.
sns.pairplot(data=df_copy, x_vars=['track_popularity', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
y_vars=['track_popularity', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
diag_kind='kde',
diag_kws={'common_norm': False})
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
There is a positive correlation between loudness and energy. Aside from this, the other scatterplots do not show any obvious linear or non-linear relationships, with the points generally appearing as unstructured clouds.
fig, ax = plt.subplots()
sns.heatmap(data = df_copy.loc[:, ['track_popularity', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']].corr(),
vmin=-1, vmax=1, center = 0,
cmap='coolwarm',
annot=True, annot_kws={'size': 7},
ax=ax)
plt.show()
The correlation plot above helps to see the relationships between continuous variables more clearly. It shows a somewhat strong positive correlation between loudness and energy and moderate negative correlation between acousticness and energy. Most other variable pairs show weak correlations, both positive and negative. The target variable, track_popularity, is only weakly correlated with other features - showing a slight positive correlation with acousticness and slight negative correlations with energy, instrumentalness, and duration_ms.
4. Summaries of the continuous variables grouped by categorical variables¶
First, I will study distributions of all continuous variables grouped by playlist_genre.
sns.displot(data=df_lf, x='value', col='variable', col_wrap=3, kind='kde',
hue='playlist_genre',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False,
palette='bright')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The distributions of continuous variables such as danceability, energy, speechiness, acousticness, instrumentalness, valence, tempo, and duration_ms vary substantially across different playlist genres.
The plot below shows distributions of all continuous variables grouped by key.
sns.displot(data=df_lf, x='value', col='variable', col_wrap=3, kind='kde',
hue='key',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False,
palette='bright')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
While there are some subtle shifts, the distributions of the continuous variables are largely consistent across the different musical keys. This suggests that musical key is not a primary factor influencing these audio features.
The plot below shows distributions of all continuous variables grouped by mode.
sns.displot(data=df_lf, x='value', col='variable', col_wrap=3, kind='kde',
hue='mode',
facet_kws={'sharex': False, 'sharey': False},
common_norm=False)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The distributions of continuous variables appear largely similar across the two categories of mode. This suggests that mode may not have a strong influence on the distributions of these continuous features.
I will now create a series of point plots to explore how the means of all continuous variables vary across 3 categorical variables (playlist_genre, mode, and key). Each facet represents a different continuous variable, the x-axis shows the categories of a selected categorical variable, and y-axis shows values of the continuous variable within each facet.
sns.catplot(data = df_lf, x='playlist_genre', y='value',col='variable', col_wrap=3, kind='point', hue='playlist_genre',
sharey=False, join=False)
plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\3624792511.py:1: UserWarning: The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`. sns.catplot(data = df_lf, x='playlist_genre', y='value',col='variable', col_wrap=3, kind='point', hue='playlist_genre', C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The point plots above suggest meaningful differences in the mean values of continuous variables across some playlist genres. In particular, the facet showing the target variable track_popularity reveals that songs in pop, rap, rock, and Latin playlists tend to have higher average popularity, with overlapping confidence intervals. In contrast, songs in edm playlists have a significantly lower mean popularity, with a non-overlapping confidence interval - indicating a likely true difference in average popularity compared to the other genres.
sns.catplot(data = df_lf, x='key', y='value',col='variable', col_wrap=3, kind='point', hue='key',
sharey=False, join=False)
plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\1088206355.py:1: UserWarning: The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`. sns.catplot(data = df_lf, x='key', y='value',col='variable', col_wrap=3, kind='point', hue='key', C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The point plots above suggest that there are meaningful differences in the means of some continuous variables grouped by key. Focusing on the facet showing the target variable track_popularity, we can see that the plot suggests some meaninful differences in the average popularity across different keys. For example, songs in key 8 have higher average popularity than songs in keys 2, 6, 7, and 12.
sns.catplot(data = df_lf, x='mode', y='value',col='variable', col_wrap=3, kind='point', hue='mode',
sharey=False, join=False)
plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\1594811114.py:1: UserWarning: The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`. sns.catplot(data = df_lf, x='mode', y='value',col='variable', col_wrap=3, kind='point', hue='mode', C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The point plot above shows meaningful differences in the average values of track popularity, danceability, loudness, speechiness, and tempo depending on the mode of the song. Focusing on the facet for track_popularity, songs in mode 1 have higher average popularity than songs in mode 0.
sns.catplot(data = df_lf, x='binary_outcome', y='value',col='variable', col_wrap=3, kind='point', hue='binary_outcome',
sharey=False, join=False)
plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\899696440.py:1: UserWarning: The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`. sns.catplot(data = df_lf, x='binary_outcome', y='value',col='variable', col_wrap=3, kind='point', hue='binary_outcome', C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The point plot reveals statistically significant differences in the mean values of several audio features between popular and unpopular songs. Specifically, popular songs tend to have higher danceability, loudness, acousticness, while showing lower average valence and lower energy, instrumentalness', 'liveness', and 'duration_ms.
4. Histograms and relationships between continuous inputs broken up by the outcome unique values¶
The plot below shows the distributions of all continuous variables, grouped by binary_outcome, along the diagonal, as well as relationships between input continuous variables in the off-diagonal plots, also grouped by binary_outcome.
sns.pairplot(data=df_copy, x_vars=['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
y_vars=['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'],
hue='binary_outcome',
diag_kind='kde',
diag_kws={'common_norm': False})
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Grouping the relationships between continuous variables by binary_outcome did not reveal any clear patterns or separation between the two classes. Across all scatterplots, the data points for both outcomes appear largely overlapping, suggesting little visual distinction between them.
The histograms below show the distribution of each continuous input variable separately for the two categories of the binary_outcome.
sns.displot(data=df_copy, x='danceability', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='energy', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='loudness', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='speechiness', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='acousticness', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='instrumentalness', col='binary_outcome', kind='hist', bins=50)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='liveness', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='valence', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='tempo', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data=df_copy, x='duration_ms', col='binary_outcome', kind='hist')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The shapes of the distributions appear similar across the two outcome groups, suggesting that none of these continuous variables are strongly predictive of the outcome on their own. The apparent difference in the height of the histograms is due to the fact that category 0 has approximately twice as many observations as category 1. Since the default histograms display raw counts, the plot for category 0 naturally appears taller, even though the underlying distribution shapes are comparable.
5. Count the number of observations for each combination of outcome unique value and the categorical input unique values¶
The dodged bar chart below shows how the 6 playlist genres are distributed across the 2 outcomes. The plot shows that most popular songs in the dataset are in the rap category and that edm has the biggest number of unpopular and the smallest number of popular songs.
sns.catplot(data=df_copy, x='playlist_genre', hue='binary_outcome', kind='count', aspect=1)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The dodged bar chart below shows how the 12 musical keys are distributed across the 2 outcomes. We can see that songs in key 1 have the highest number of unpopular songs and the highest number of popular songs.
sns.catplot(data=df_copy, x='key', hue='binary_outcome', kind='count', aspect=1)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The dodged bar chart below shows how the 2 modes are distributed across the 2 outcomes. Songs in mode 1 have the highest number of popular and the highest number of unpopular songs.
sns.catplot(data=df_copy, x='mode', hue='binary_outcome', kind='count', aspect=1)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
In summary, the EDA revealed that while playlist genres, modes, and keys are relatively balanced in the cleaned dataset, rap is the most represented genre, mode 1 is slightly more common, and keys 0, 1, and 7 dominate, with key 3 being least frequent. The dataset contains about twice as many unpopular as popular songs. Continuous variables generally deviate from a Gaussian distribution, though danceability, valence, energy, and duration are closest to normality; loudness and energy show a strong positive correlation, while acousticness and energy are moderately negatively correlated. Differences in audio features emerge across genres, keys, and modes - pop, rap, rock, and Latin genres tend to have higher popularity, while edm has the lowest. Songs in certain keys (e.g., key 8) and mode 1 show higher average popularity. Popular songs tend to have higher danceability, loudness, and acousticness but lower valence, energy, instrumentalness, liveness, and duration. However, scatterplots grouped by outcome show substantial overlap between classes, suggesting that no single continuous variable is strongly predictive of popularity on its own. Finally, no obvious linear or nonlinear relationship was observed between the continous input variables and track_popularity (see Supporting - EDA for more).
B. Clustering¶
This time, I selected danceability, valence, and speechiness as input variables for clustering. This combination captures diverse aspects of the tracks making it a strong candidate for identifying meaningful groupings.
df_cluster_vars = df_copy[['danceability', 'speechiness', 'valence']].copy()
df_cluster_vars.info()
<class 'pandas.core.frame.DataFrame'> Index: 25190 entries, 3 to 32832 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 danceability 25190 non-null float64 1 speechiness 25190 non-null float64 2 valence 25190 non-null float64 dtypes: float64(3) memory usage: 1.3 MB
Preprocessing¶
The distributions of danceability and valence are approximately Gaussian and require no transformation. However, speechiness is right-skewed, so I applied a log transformation to reduce the influence of extreme values. There is no strong correlation among these three variables which helps ensure that each variable contributes uniquely to the clustering process. None of these three variables have missing values.
df_cluster_vars['speechiness'] = np.log(df_cluster_vars.speechiness + 0.01)
df_cluster_vars.info()
<class 'pandas.core.frame.DataFrame'> Index: 25190 entries, 3 to 32832 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 danceability 25190 non-null float64 1 speechiness 25190 non-null float64 2 valence 25190 non-null float64 dtypes: float64(3) memory usage: 1.3 MB
sns.displot(data = df_copy, x='speechiness', kind='hist', kde=True)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
sns.displot(data = df_cluster_vars, x='speechiness', kind='hist', kde=True)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Next, I'm going to check the scales of the three variables. The boxplot below shows that the magnitude and scale are dominated by speechiness.
sns.catplot(data=df_cluster_vars, kind='box', aspect=1.5)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(df_cluster_vars)
sns.catplot( data=pd.DataFrame(X, columns=df_cluster_vars.columns), kind='box', aspect=1.5)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Standardization is complete!
Since the selected features are not highly correlated, I chose to cluster on the original variables rather than apply dimensionality reduction with PCA. This preserves the original interpretability of the variables, which would be lost if transformed into principal components.
As at the proposal stage, I used KMeans for clustering. While hierarchical clustering is an alternative, it is typically limited to around 15,000 observations, whereas my cleaned dataset contains 25,190. KMeans is therefore more appropriate for the size of my data.
from sklearn.cluster import KMeans
df_cluster_vars_copy = df_cluster_vars.copy()
Identifying the optimal number of clusters¶
tots_within = []
K = range(1, 31)
for k in K:
km = KMeans(n_clusters=k, random_state=121, n_init=25, max_iter=500)
km = km.fit( X )
tots_within.append(km.inertia_)
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning: Could not find the number of physical cores for the following reason:
[WinError 2] The system cannot find the file specified
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.
warnings.warn(
File "C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
cpu_info = subprocess.run(
File "C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\subprocess.py", line 493, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\subprocess.py", line 1327, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
fig, ax = plt.subplots()
ax.plot( K, tots_within, 'bo-' )
ax.set_xlabel('number of clusters')
ax.set_ylabel('total within sum of squares')
plt.show()
The Knee Bend shows that the total within-cluster sum of squares begins to level off around 5 clusters. While the curve doesn't become completely flat beyond that point, the rate of decrease seem to slow significantly.
Running KMeans for the optimal number of clusters¶
clusters_5 = KMeans(n_clusters=5, random_state=121, n_init=25, max_iter=500).fit_predict( X )
df_cluster_vars_copy['k5'] = pd.Series(clusters_5, index=df_cluster_vars_copy.index).astype('category')
df_cluster_vars_copy.info()
<class 'pandas.core.frame.DataFrame'> Index: 25190 entries, 3 to 32832 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 danceability 25190 non-null float64 1 speechiness 25190 non-null float64 2 valence 25190 non-null float64 3 k5 25190 non-null category dtypes: category(1), float64(3) memory usage: 1.3 MB
df_cluster_vars_copy.k5.value_counts()
k5 2 7132 0 5961 1 4725 4 4349 3 3023 Name: count, dtype: int64
Visualizing clustering results¶
sns.pairplot(data = df_cluster_vars_copy, hue='k5', diag_kws={'common_norm': False})
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Cluster-wise Summary and Distribution of Danceability, Speechiness, and Valence¶
Danceability distribution across 5 clusters¶
sns.catplot(data=df_cluster_vars_copy, x='k5', y='danceability', kind='violin', hue='k5', aspect=1.5)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Cluster 4 includes the most danceable songs, showing the highest median and a relatively tight distribution, though it overlaps somewhat with Clusters 0 and 2, which fall into an intermediate-to-high danceability range. These intermediate clusters share a similar range and show overlap with both Cluster 4 at the high end and Cluster 3, which occupies an intermediate range.
Cluster 1 stands out as the least danceable group, with the lowest median and minimal overlap with the more danceable Cluster 3. Cluster 3 bridges the gap between the intermediate and low-danceability groups, overlapping significantly with Clusters 0 and 2, and only slightly with Cluster 1.
While the clusters are not perfectly distinct, the contrast between the high median in Cluster 4 and the low median in Cluster 1 suggests that danceability meaningfully separates the most and least danceable groups. Additionally, the long lower tails seen in Clusters 1, 2, 3, and 4 indicate greater variability and some skewness in danceability values, especially toward the lower end of the scale.
Speechiness distribution across 5 clusters¶
sns.catplot(data=df_cluster_vars_copy, x='k5', y='speechiness', kind='violin', hue='k5', aspect=1.5)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Clusters 0, 1, and 2 share similarly low median values, closely aligned distribution shapes, and fully overlapping violins, forming a clear low-speechiness group. In contrast, Clusters 3 and 4 exhibit higher, comparable medians and substantial overlap, suggesting a high-speechiness group.
Notably, the two groups - Clusters 0, 1, and 2 and Clusters 3 and 4 - show no overlap in their interquartile ranges, indicating a clear separation between low and high speechiness clusters.
Cluster 1 stands out slightly within the low-speechiness group due to its long tails on both ends, indicating greater variability in speechiness values.
Valence distribution across 5 clusters¶
sns.catplot(data=df_cluster_vars_copy, x='k5', y='valence', kind='violin', hue='k5', aspect=1.5)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Clusters 0, 1, and 3 have similar medians and substantially overlapping interquartile ranges, forming a low-valence group - that is, clusters with songs that tend to sound more negative.
In contrast, Clusters 2 and 4 have higher median values and significantly overlapping interquartile ranges, indicating a higher-valence group associated with more positive-sounding songs.
The slight overlap between Clusters 3 and 4 suggests that Cluster 3 shares some valence characteristics with the higher-valence group, despite generally aligning with lower-valence clusters. Additionally, the long upward tails of Clusters 1 and 3 and the long downward tail of Cluster 4 reflect greater variability in valence values at the distribution extremes.
In summary, Clusters 0, 1, and 2 share low speechiness, with similar medians and overlapping distributions, while Clusters 3 and 4 form a distinct high-speechiness group. For danceability, Cluster 4 is the most danceable, Cluster 1 the least, and the others fall in between with overlapping ranges. Valence separates into a low group (Clusters 0, 1, and 3) and a high group (Clusters 2 and 4), with slight overlap between Clusters 3 and 4. Cluster 4 consistently stands out with high values across all three features, while Cluster 1 is low on all three and shows the greatest variability.
Comparing the cluster assignments to unique values of categorical inputs¶
df_cluster_vars_copy['playlist_genre'] = df_copy.playlist_genre
df_cluster_vars_copy['playlist_subgenre'] = df_copy.playlist_subgenre
df_cluster_vars_copy['key'] = df_copy.key
df_cluster_vars_copy['mode'] = df_copy['mode']
paylist_genre vs k5¶
fig, ax = plt.subplots()
sns.heatmap(data = pd.crosstab( df_cluster_vars_copy.playlist_genre, df_cluster_vars_copy.k5, margins=True ),
annot=True, annot_kws={"fontsize": 15}, fmt='g',
cbar=True,
ax=ax)
plt.show()
sns.catplot(data = df_cluster_vars_copy, x='playlist_genre', hue='k5', kind='count')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The heatmap and the dodged bar chart above show that while all clusters include a mix of genres, certain clusters have stronger associations with specific genres. For example, Cluster 1 is heavily represented by rock tracks, and Cluster 4 has the highest concentration of rap songs. Cluster 2, the largest group, appears relatively balanced across genres, while Cluster 0 leans more toward EDM and pop. Cluster 3 has a moderate concentration of rap and r&b. Overall, genre does not strictly define clusters, but some genre-based patterns are noticeable.
key vs k5¶
fig, ax = plt.subplots()
sns.heatmap(data = pd.crosstab( df_cluster_vars_copy.key, df_cluster_vars_copy.k5, margins=True ),
annot=True, annot_kws={"fontsize": 13}, fmt='g',
cbar=True,
ax=ax)
plt.show()
sns.catplot(data = df_cluster_vars_copy, x='key', hue='k5', kind='count')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Similarly to the previous plots, the heatmap and the dodged bar chart above show that all clusters contain songs across a variety of musical keys. While some keys appear slightly more frequently, there is no strong or exclusive association between any particular key and any cluster. This suggests that musical key is not a major factor driving the clustering structure.
mode vs k5¶
fig, ax = plt.subplots()
sns.heatmap(data = pd.crosstab( df_cluster_vars_copy['mode'], df_cluster_vars_copy.k5, margins=True ),
annot=True, annot_kws={"fontsize": 13}, fmt='g',
cbar=True,
ax=ax)
plt.show()
sns.catplot(data = df_cluster_vars_copy, x='mode', hue='k5', kind='count')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Similarly to the other visualizations, the heatmap and the dodged bar chart above show that all clusters contain a mix of both major and minor songs. Clusters 3 and 4 have a relatively balanced distribution of modes, while Clusters 0, 1, and 2 show a slight dominance of major songs (mode = 1). This suggests that modality is not a key factor influencing the clustering structure.
Comparing the cluster assignments to unique values of the outcome¶
df_cluster_vars_copy['binary_outcome'] = df_copy.binary_outcome
fig, ax = plt.subplots()
sns.heatmap(data = pd.crosstab( df_cluster_vars_copy.binary_outcome, df_cluster_vars_copy.k5, margins=True ),
annot=True, annot_kws={"fontsize": 15}, fmt='g',
cbar=True,
ax=ax)
plt.show()
sns.catplot(data = df_cluster_vars_copy, x='binary_outcome', hue='k5', kind='count')
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The heatmap and the dodged bar chart above show that all clusters contain a roughly similar proportion of songs with popularity above and below 50. This suggests that the binary popularity outcome is not a major factor driving the clustering structure.
In summary, the clustering grouped songs into distinct profiles based on their danceability, valence, and speechiness. The key finding is that these clusters did not separate popular from unpopular songs, which reinforces our EDA's conclusion that no single audio feature is strongly predictive on its own. The analysis also revealed noticeable genre-based patterns within the clusters, supporting the EDA's finding that audio features differ across genres and suggesting this relationship is important to explore in the modeling stage.
In Section B (EDA), we saw that the distributions of the continuous input variables loudness, speechiness, instrumentalness, and livenessare not Gaussian-like. To address this, I applied a natural log transformation to speechiness, instrumentalness, and liveness. Because loudness includes negative values, neither the natural log nor square root transformation is applicable. Instead, I applied a cube root transformation to loudness, which works well with negative values and helps reduce skew.
Although the distribution of tempo shows multiple peaks and is not perfectly symmetric, it is not strongly skewed. Therefore, I used it in its original form. The remaining continuous input variables were already approximately Gaussian-like and did not require transformation.
numeric_vars = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']
df_numeric_inputs = df_copy[numeric_vars].copy()
df_numeric_inputs.describe()
| danceability | energy | loudness | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 25190.000000 | 25190.000000 | 25190.000000 | 25190.000000 | 25190.000000 | 25190.000000 | 25190.000000 | 25190.000000 | 25190.000000 | 25190.000000 |
| mean | 0.652559 | 0.697178 | -6.896591 | 0.108713 | 0.179709 | 0.095998 | 0.191432 | 0.510365 | 120.981311 | 226843.724732 |
| std | 0.146274 | 0.185624 | 3.068494 | 0.103691 | 0.225940 | 0.238703 | 0.157258 | 0.235183 | 26.994157 | 61983.810649 |
| min | 0.000000 | 0.000175 | -46.448000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4000.000000 |
| 25% | 0.560000 | 0.577000 | -8.412000 | 0.041000 | 0.014100 | 0.000000 | 0.092500 | 0.328000 | 99.976000 | 187539.500000 |
| 50% | 0.669000 | 0.721000 | -6.345500 | 0.062700 | 0.080200 | 0.000025 | 0.127000 | 0.512000 | 121.994000 | 217200.000000 |
| 75% | 0.760000 | 0.844000 | -4.764000 | 0.135000 | 0.265000 | 0.008108 | 0.249000 | 0.695000 | 134.057000 | 255413.000000 |
| max | 0.983000 | 1.000000 | 1.275000 | 0.918000 | 0.994000 | 0.994000 | 0.996000 | 0.991000 | 239.440000 | 517810.000000 |
df_numeric_inputs['loudness'] = np.cbrt(df_numeric_inputs.loudness)
df_numeric_inputs['speechiness'] = np.log(df_numeric_inputs.speechiness + 0.01)
df_numeric_inputs['acousticness'] = np.log(df_numeric_inputs.acousticness + 0.01)
df_numeric_inputs['instrumentalness'] = np.log(df_numeric_inputs.instrumentalness + 0.01)
df_numeric_inputs['liveness'] = np.log(df_numeric_inputs.liveness + 0.01)
df_numeric_inputs.info()
<class 'pandas.core.frame.DataFrame'> Index: 25190 entries, 3 to 32832 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 danceability 25190 non-null float64 1 energy 25190 non-null float64 2 loudness 25190 non-null float64 3 speechiness 25190 non-null float64 4 acousticness 25190 non-null float64 5 instrumentalness 25190 non-null float64 6 liveness 25190 non-null float64 7 valence 25190 non-null float64 8 tempo 25190 non-null float64 9 duration_ms 25190 non-null int64 dtypes: float64(9), int64(1) memory usage: 2.6 MB
sns.catplot( data=df_numeric_inputs, kind='box', aspect=2)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The boxplot above shows that the magnitude and scale are dominated by duration_ms.
df_numeric_scaled = pd.DataFrame(StandardScaler().fit_transform(df_numeric_inputs),
columns=df_numeric_inputs.columns,
index=df_numeric_inputs.index)
sns.catplot(data=df_numeric_scaled, kind='box', aspect=2)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Standardization is complete!
Creating a dataset with preprocessed continuous input variables, categorical input variables, and binary_outcome¶
df_modeling = df_numeric_scaled.copy()
df_modeling['playlist_genre'] = df_copy.playlist_genre
df_modeling['key'] = df_copy.key
df_modeling['mode'] = df_copy['mode']
df_modeling['binary_outcome'] = df_copy.binary_outcome.astype(int)
df_modeling.info()
<class 'pandas.core.frame.DataFrame'> Index: 25190 entries, 3 to 32832 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 danceability 25190 non-null float64 1 energy 25190 non-null float64 2 loudness 25190 non-null float64 3 speechiness 25190 non-null float64 4 acousticness 25190 non-null float64 5 instrumentalness 25190 non-null float64 6 liveness 25190 non-null float64 7 valence 25190 non-null float64 8 tempo 25190 non-null float64 9 duration_ms 25190 non-null float64 10 playlist_genre 25190 non-null object 11 key 25190 non-null object 12 mode 25190 non-null object 13 binary_outcome 25190 non-null int32 dtypes: float64(10), int32(1), object(3) memory usage: 3.3+ MB
Defining a list of formulas for eight models¶
I will fit and evaluate eight distinct logistic regression models to predict song popularity. This set includes six standard models required by the project instructions, progressing from a simple baseline to complex interaction models.
In addition, I will test two custom models. The first (Model 7) is directly motivated by my EDA, which revealed that the distributions of key audio features like danceability and energy vary significantly across playlist genres. This model tests the hypothesis that the "recipe" for a popular song is different for each genre.
The second (Model 8) is an exploratory model developed after finding no clear linear or U-shaped patterns in my initial scatter plots. This model tests a different non-linear hypothesis: that the relationship between a song's popularity and its tempo is not linear, but cyclical. This theory explores the idea that certain BPM ranges, like those ideal for activities such as dancing or driving, could be more popular than others.
Thus, my list includes the following models:
- Model 1: intercept-only or constant average model
- Model 2: categorical inputs with additive features
- Model 3: continuous inputs with linear additive features
- Model 4: all inputs (continuous and categorical) with linear additive features
- Model 5: continuous inputs with linear main effect and pair-wise interactions
- Model 6: interaction between categorical and continuous inputs (including main effects)
- Model 7: main effects plus the interaction between
playlist_genreand a targeted subset of features - Model 8: cyclical (sine)
tempofeature interacting with with playlist_genre interaction, plus main effects of other key audio features
formula_list = ['binary_outcome ~ 1', # intercept-only model
'binary_outcome ~ playlist_genre + key + mode', # categorical inputs with additive features
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms', # continuous inputs with linear additive features
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + playlist_genre + key + mode', # all inputs (continuous and categorical) with linear additive features
'binary_outcome ~ (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)**2', # continuous inputs with linear main effect and pair-wise interactions
'binary_outcome ~ (playlist_genre + key + mode) * (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)', # interaction between categorical and continuous inputs (including main effects)
'binary_outcome ~ playlist_genre * (danceability + energy + speechiness + instrumentalness + valence + tempo)', # interaction between playlist_genre and a targeted subset of continuous features
'binary_outcome ~ playlist_genre * np.sin(tempo) + danceability + energy + acousticness'] # cyclical (sine) tempo feature interacting with with playlist_genre interaction, plus main effects of other key audio features
Fitting the models, checking their coefficients, and showing the performance on the training set¶
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
I will define a single function to streamline the analysis. For any given model, this function will automatically fit the model, analyze its coefficients (number, significance, and magnitude), and calculate all required performance metrics (Accuracy, Sensitivity, Precision, Specificity, FPR, F1 score, and ROC AUC).
def fit_and_analyze_logistic(mod_name, a_formula, train_data, threshold=0.5):
a_mod = smf.logit(formula=a_formula, data=train_data).fit()
params = a_mod.params
pvalues = a_mod.pvalues
significant_params = params[pvalues < 0.05]
top_2_features = []
if len(significant_params) > 0:
top_2_features = (significant_params ** 2).sort_values(ascending=False).head(3).index.tolist()
train_copy = train_data.copy()
train_copy['pred_probability'] = a_mod.predict( train_data )
train_copy['pred_class'] = np.where( train_copy.pred_probability > threshold, 1, 0 )
TN, FP, FN, TP = confusion_matrix( train_copy.binary_outcome.to_numpy(), train_copy.pred_class.to_numpy() ).ravel()
Accuracy = (TN + TP) / (TN + FP + FN + TP)
Sensitivity = (TP) / (TP + FN)
Precision = (TP) / (TP + FP) if (TP + FP) > 0 else 0
Specificity = (TN) / (TN + FP)
FPR = 1 - Specificity
F1_Score = 2 * (Precision * Sensitivity) / (Precision + Sensitivity) if (Precision + Sensitivity) > 0 else 0
ROC_AUC = roc_auc_score( train_copy.binary_outcome.to_numpy(), train_copy.pred_probability.to_numpy() )
res_dict = {'model_name': mod_name,
'model_formula': a_formula,
'num_coefs': len(a_mod.params),
'num_sign_coefs': len(significant_params),
'sign_coefs_&_their_values': [significant_params.to_dict()],
'top_2_features': [top_2_features],
'Accuracy': Accuracy,
'Sensitivity': Sensitivity,
'Precision': Precision,
'Specificity': Specificity,
'FPR': FPR,
'F1 Score': F1_Score,
'ROC_AUC': ROC_AUC}
return pd.DataFrame(res_dict, index=[0])
Model 1. Intercept-only or constant average model¶
formula_list[0]
'binary_outcome ~ 1'
fit_glm_1 = smf.logit(formula=formula_list[0], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.632687
Iterations 4
fit_glm_1.params
Intercept -0.717663 dtype: float64
df_modeling_copy = df_modeling.copy()
df_modeling_copy['pred_probability_M1'] = fit_glm_1.predict(df_modeling)
df_modeling_copy['pred_class_M1'] = np.where(df_modeling_copy.pred_probability_M1 > 0.5, 1, 0)
fig, ax = plt.subplots()
sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M1, margins=True),
annot=True, annot_kws={'size': 20}, fmt='3d')
plt.show()
As we can see from the confusion matrix above, the model doesn't classify any songs as "popular".
model_1_results = fit_and_analyze_logistic(mod_name='Model 1', a_formula=formula_list[0], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.632687
Iterations 4
model_1_results
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 1 | binary_outcome ~ 1 | 1 | 1 | {'Intercept': -0.717662608612157} | [Intercept] | 0.672092 | 0.0 | 0 | 1.0 | 0.0 | 0 | 0.5 |
df_modeling_copy.pred_probability_M1.value_counts()
pred_probability_M1 0.327908 25190 Name: count, dtype: int64
df_modeling.binary_outcome.value_counts(normalize=True)
binary_outcome 0 0.672092 1 0.327908 Name: proportion, dtype: float64
This model serves as a simple baseline. With no predictor variables, it calculates the average likelihood of a song being popular across the entire dataset.
The model's single coefficient, the intercept (approximately -0.72), is statistically significant. This model predicts the same probability of being popular - approximately 33% - for every song. Because this constant predicted probability of 33% is below the 0.5 decision threshold, the model logically classifies every song as "not popular". This directly explains its performance metrics.
- The accuracy is approximately 0.67, which simply reflects the class imbalance in the dataset; 67% of all songs belong to the "not popular" (majority) class. The model gets them all right by default.
- The sensitivity is 0. Because the model never predicts a song as popular, it fails to correctly identify any of the truly popular songs.
- The precision is also 0. Since the model makes no positive predictions, the number of True Positives is zero, resulting in a precision score of 0.
- The specificity is a perfect 1, and therefore the FPR is 0. This is because the model correctly classifies every truly unpopular song as "not popular".
- The F1 Score is 0, which is expected as it balances precision and sensitivity. A score of 0 indicates a complete failure to identify the positive class.
- The ROC AUC is 0.5, which confirms the model has no ability to distinguish between popular and unpopular songs.
Model 2. Categorical inputs with additive features¶
formula_list[1]
'binary_outcome ~ playlist_genre + key + mode'
fit_glm_2 = smf.logit(formula=formula_list[1], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.615117
Iterations 6
df_modeling_copy['pred_probability_M2'] = fit_glm_2.predict(df_modeling)
df_modeling_copy['pred_class_M2'] = np.where(df_modeling_copy.pred_probability_M2 > 0.5, 1, 0)
fig, ax = plt.subplots()
sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M2, margins=True),
annot=True, annot_kws={'size': 20}, fmt='3d')
plt.show()
Similarly to the intercept-only model, Model 2 does not classify any songs as "popular".
model_2_results = fit_and_analyze_logistic(mod_name='Model 2', a_formula=formula_list[1], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.615117
Iterations 6
model_2_results
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 2 | binary_outcome ~ playlist_genre + key + mode | 18 | 7 | {'Intercept': -1.633228568540369, 'playlist_ge... | [Intercept, playlist_genre[T.pop], playlist_ge... | 0.672092 | 0.0 | 0 | 1.0 | 0.0 | 0 | 0.599069 |
This model, which includes categorical inputs playlist_genre, key, and mode as predictors, has 18 coefficients, with 7 being statistically significant. This indicates that we can be confident that these 7 features have a non-zero effect on a song's popularity and that we can trust the direction of that relationship as indicated by the coefficients' signs. Below is the full list of all statistically significant features and their values:
model_2_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.633228568540369,
'playlist_genre[T.latin]': 1.0789336718505969,
'playlist_genre[T.pop]': 1.2322868891076435,
'playlist_genre[T.r&b]': 0.8023634295139332,
'playlist_genre[T.rap]': 1.1026573373088557,
'playlist_genre[T.rock]': 1.2030982403071422,
'key[T.7]': -0.12411197465085606}
model_2_results.top_2_features[0]
['Intercept', 'playlist_genre[T.pop]', 'playlist_genre[T.rock]']
The two features with the most impactful coefficients are the pop (around 1.23) and rock (around 1.20) genres. These are part of a broader pattern where five genres in total - including rap, latin, and r&b - all have a statistically significant positive association with popularity compared to the edm baseline. This provides strong evidence that a song's genre has a non-zero effect on its popularity.
df_modeling_copy.pred_probability_M2.describe()
count 25190.000000 mean 0.327908 std 0.084563 min 0.147124 25% 0.299322 50% 0.361415 75% 0.386307 max 0.432853 Name: pred_probability_M2, dtype: float64
The highest predicted probability for any song is approximately 0.43. Because no prediction crosses the 0.5 decision threshold, this model still classifies all songs as "not popular," just like the intercept-only baseline. Consequently, all performance metrics that depend on the final class predictions - Accuracy, Sensitivity, Precision, and F1 Score - are identical to Model 1. For example, the Sensitivity is 0 because no popular songs are correctly identified.
The key improvement is the ROC AUC, which increased from the baseline of 0.5 to roughly 0.60. This shows the model now has a weak but real ability to distinguish between the classes. Although its predicted probabilities don't cross the 0.5 threshold, it has learned to give slightly higher scores to popular songs than to unpopular ones.
Model 3. Continuous inputs with linear additive features¶
formula_list[2]
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms'
fit_glm_3 = smf.logit(formula=formula_list[2], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.613043
Iterations 5
df_modeling_copy['pred_probability_M3'] = fit_glm_3.predict(df_modeling)
df_modeling_copy['pred_class_M3'] = np.where(df_modeling_copy.pred_probability_M3 > 0.5, 1, 0)
fig, ax = plt.subplots()
sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M3, margins=True),
annot=True, annot_kws={'size': 20}, fmt='3d')
plt.show()
Unlike the previous models, Model 3 correctly classifies some songs as popular, though the number of songs it predicts as popular is much smaller than the total number of truly popular songs.
model_3_results = fit_and_analyze_logistic(mod_name='Model 3', a_formula=formula_list[2], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.613043
Iterations 5
model_3_results
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 10 | {'Intercept': -0.7548636863231273, 'energy': -... | [Intercept, energy, loudness] | 0.670822 | 0.027966 | 0.467611 | 0.984465 | 0.015535 | 0.052776 | 0.617639 |
df_modeling_copy.pred_probability_M3.describe()
count 25190.000000 mean 0.327908 std 0.090986 min 0.054246 25% 0.270384 50% 0.335194 75% 0.390342 max 0.873626 Name: pred_probability_M3, dtype: float64
This model, which includes continuous inputs with linear additive features as predictors, has 11 coefficients, with 10 being statistically significant. This indicates that we can be confident that these 10 audio features have a non-zero effect on a song's popularity and that we can trust the direction of that relationship as indicated by the coefficients' signs. Below is the full list of all statistically significant features and their values:
model_3_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -0.7548636863231273,
'energy': -0.3011426478648988,
'loudness': 0.25856875376431365,
'speechiness': -0.0738259110375911,
'acousticness': 0.0976045841097421,
'instrumentalness': -0.23343668704410966,
'liveness': -0.03467033771999454,
'valence': 0.04572820472766655,
'tempo': 0.07046847805346242,
'duration_ms': -0.15488234284077587}
The two coefficients with the highest magnitude apart from the intercept (around -0.76) are energy (around -0.30) and loudness (around 0.25). This indicates that, after controlling for other factors, an increase in a song's loudness is associated with a higher likelihood of it being popular, while a corresponding increase in energy is associated with a lower likelihood. It's important to note that energy and loudness are moderately correlated (0.68). While this does not impact the model's overall predictive power, the individual coefficients for these two features should be interpreted with caution, as the model may have difficulty separating their unique effects.
A key improvement is that the model's predicted probabilities now go as high as 0.87, so unlike the previous models, it now classifies some songs as "popular." The model's performance reveles a clear trade-off: the model's primary strength is its very high specificity (around 0.98), which means it is excellent at correctly identifying unpopular songs. This results in a low FPR of just 1.6% which means the model incorrectly flags only 1.6% of the truly "unpopular" songs as "popular".
However, this high specificity comes at the cost of extremely low sensitivity (around 0.03), as the model finds only 3% of all truly popular songs. This poor performance on the positive class is also reflected in the Precision (around 0.47), which shows the model is correct less than half the time it predicts "popular," and the very low F1 Score (aroud 0.05), which confirms the model is not well-balanced, as its performance on the positive class is poor.
While the accuracy (around 0.67) is misleading due to class imbalance, the ROC AUC of 0.62 is the most reliable indicator. It confirms this model has a modest but improved ability to distinguish between classes compared to the baseline and Model 2.
Model 4. All inputs (continuous and categorical) with linear additive features¶
formula_list[3]
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + playlist_genre + key + mode'
fit_glm_4 = smf.logit(formula=formula_list[3], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.598450
Iterations 6
df_modeling_copy['pred_probability_M4'] = fit_glm_4.predict(df_modeling)
df_modeling_copy['pred_class_M4'] = np.where(df_modeling_copy.pred_probability_M4 > 0.5, 1, 0)
fig, ax = plt.subplots()
sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M4, margins=True),
annot=True, annot_kws={'size': 20}, fmt='3d')
plt.show()
Similar to Model 3, Model 4 correctly classifies some songs as popular. However, it still fails to identify a large portion of the truly popular songs.
model_4_results = fit_and_analyze_logistic(mod_name='Model 4', a_formula=formula_list[3], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.598450
Iterations 6
model_4_results
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 15 | {'Intercept': -1.5917205139340989, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.669075 | 0.084988 | 0.474324 | 0.954046 | 0.045954 | 0.144148 | 0.656647 |
df_modeling_copy.pred_probability_M4.describe()
count 25190.000000 mean 0.327908 std 0.119032 min 0.031741 25% 0.241488 50% 0.339460 75% 0.414837 max 0.950219 Name: pred_probability_M4, dtype: float64
This model, which includes all inputs with linear additive features, has 28 coefficients, with 15 being statistically significant. Below is the full list of all statistically significant features and their values:
model_4_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.5917205139340989,
'playlist_genre[T.latin]': 0.8710920712060213,
'playlist_genre[T.pop]': 1.0828670623658054,
'playlist_genre[T.r&b]': 0.6345725938683296,
'playlist_genre[T.rap]': 0.8857224431014801,
'playlist_genre[T.rock]': 1.4025550042702633,
'danceability': 0.11759980055645763,
'energy': -0.2951494814511126,
'loudness': 0.33056018693286277,
'speechiness': -0.047355152773171635,
'acousticness': 0.10518289624176624,
'instrumentalness': -0.1609632391131655,
'valence': -0.0459921965724886,
'tempo': 0.08061844635555175,
'duration_ms': -0.16591935884430412}
model_4_results.top_2_features[0]
['Intercept', 'playlist_genre[T.rock]', 'playlist_genre[T.pop]']
Among the model's features, the two with the highest magnitude coefficients are both genres: playlist_genre[T.rock] (around 1.40) and playlist_genre[T.pop] (around 1.08). This shows that even after accounting for a song's specific audio features, its genre has a strong association with its popularity. Compared to the edm baseline genre, songs in the rock and pop categories are significantly more likely to be popular, with rock showing the strongest effect.
We can also see some effects from continuous features. For example, higher loudness (around 0.33) is associated with an increased likelihood of being popular, while higher energy (around -0.30) is associated with a decreased likelihood. This helps to understand of what a "popular" song in this dataset can look like. It is important, however, to interpret the individual strength of these two effects with caution, as energy and loudness are moderately correlated in the data.
This model shows a clear, though modest, improvement in performance over previous models. The ROC AUC increased to around 0.66, indicating better overall ability to distinguish between classes.
The most significant improvement is in the model's ability to find positive cases. The sensitivity increased to 0.085 and the F1 Score rose to 0.14. While still low, their values are higher than in Model 3. At the same time, this improvement comes with a trade-off, as specificity dropped to around 0.95, which means the model makes slightly more false positive errors to find more true positives. The accuracy (around 0.67) remains a misleading metric due to the class imbalance.
Model 5. Continuous inputs with linear main effect and pair-wise interactions¶
formula_list[4]
'binary_outcome ~ (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)**2'
fit_glm_5 = smf.logit(formula=formula_list[4], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.604714
Iterations 6
df_modeling_copy['pred_probability_M5'] = fit_glm_5.predict(df_modeling)
df_modeling_copy['pred_class_M5'] = np.where(df_modeling_copy.pred_probability_M5 > 0.5, 1, 0)
fig, ax = plt.subplots()
sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M5, margins=True),
annot=True, annot_kws={'size': 20}, fmt='3d')
plt.show()
model_5_results = fit_and_analyze_logistic(mod_name='Model 5', a_formula=formula_list[4], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.604714
Iterations 6
model_5_results
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 5 | binary_outcome ~ (danceability + energy + loud... | 56 | 27 | {'Intercept': -0.7532051680487379, 'energy': -... | [Intercept, energy, loudness] | 0.675347 | 0.065375 | 0.541082 | 0.972947 | 0.027053 | 0.116656 | 0.637941 |
This model adds significant complexity by including all possible two-way interactions between the continuous features. It has 56 coefficients, with 27 being statistically significant. Below is the full list of all statistically significant features and their values:
model_5_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -0.7532051680487379,
'energy': -0.3106834076116351,
'loudness': 0.2793634428566775,
'speechiness': -0.09220691713082607,
'acousticness': 0.09373444081455799,
'instrumentalness': -0.2620295078594359,
'liveness': -0.033590554154820436,
'valence': 0.03973166789264538,
'tempo': 0.051175350580958866,
'duration_ms': -0.14365658029877337,
'danceability:speechiness': 0.05272747462445708,
'danceability:acousticness': 0.07878215297136672,
'danceability:instrumentalness': -0.06562398956563534,
'danceability:liveness': 0.035630246008381584,
'danceability:duration_ms': -0.06043953932417578,
'energy:loudness': -0.1234989177067231,
'energy:speechiness': -0.05519997122774107,
'energy:valence': 0.04692717948986716,
'loudness:acousticness': -0.06783866101599377,
'loudness:instrumentalness': -0.10289914832890394,
'loudness:valence': 0.06118481775443594,
'speechiness:valence': -0.050890582008801595,
'speechiness:tempo': 0.03383822265464645,
'speechiness:duration_ms': -0.033908687164200556,
'acousticness:valence': -0.05890538265405652,
'acousticness:tempo': 0.036702183815766,
'acousticness:duration_ms': 0.04642356234542932}
The two statistically significant features with the highest magnitude are the and energy (around -0.31) and loudness (around 0.28). The key new finding is the presence of numerous significant interaction terms. For example, the significant negative interaction between energy and loudness suggests that the relationship is not simple; the positive effect of a song being loud may be reduced if the song is also very energetic. Similarly to Model 3 and 4, it is important here to interpret the individual strength of the effects energy and loudness with caution, as they are moderately correlated.
Despite the added complexity, this model's performance did not improve over the simpler Model 4. The ROC AUC decreased slightly to around 0.64, and the F1 Score also fell to around 0.12. While precision slightly increased to 0.54, sensitivity dropped to around 0.065, meaning the model is now even worse at identifying popular songs.
Model 6. Interaction between categorical and continuous inputs (including main effects)¶
formula_list[5]
'binary_outcome ~ (playlist_genre + key + mode) * (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)'
fit_glm_6 = smf.logit(formula=formula_list[5], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.588672
Iterations 6
df_modeling_copy['pred_probability_M6'] = fit_glm_6.predict(df_modeling)
df_modeling_copy['pred_class_M6'] = np.where(df_modeling_copy.pred_probability_M6 > 0.5, 1, 0)
fig, ax = plt.subplots()
sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M6, margins=True),
annot=True, annot_kws={'size': 20}, fmt='3d')
plt.show()
model_6_results = fit_and_analyze_logistic(mod_name='Model 6', a_formula=formula_list[5], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.588672
Iterations 6
model_6_results
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 36 | {'Intercept': -1.4692198939587524, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.68162 | 0.159927 | 0.549958 | 0.936149 | 0.063851 | 0.247796 | 0.678517 |
Model 6, which includes interaction between categorical and continuous inputs (including main effects), has 198 coefficients, with 36 of them being statistically significant. Below is the full list of all statistically significant features and their values:
model_6_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.4692198939587524,
'playlist_genre[T.latin]': 0.7129044076828355,
'playlist_genre[T.pop]': 0.9697884006457378,
'playlist_genre[T.r&b]': 0.5410788520732197,
'playlist_genre[T.rap]': 0.683108840753402,
'playlist_genre[T.rock]': 1.0648959049386686,
'playlist_genre[T.pop]:danceability': 0.16260997561869103,
'playlist_genre[T.rap]:danceability': 0.32166916627430203,
'energy': -0.30098777260781245,
'key[T.5]:energy': 0.28262384287694486,
'key[T.10]:energy': 0.2675946672055558,
'loudness': 0.22268063960026133,
'playlist_genre[T.latin]:loudness': 0.308053273645286,
'playlist_genre[T.pop]:loudness': 0.2308701655335041,
'playlist_genre[T.r&b]:loudness': 0.31156778841244914,
'key[T.5]:loudness': -0.2276773532572789,
'playlist_genre[T.pop]:speechiness': 0.141346409361902,
'key[T.4]:speechiness': -0.14989435832592057,
'key[T.5]:speechiness': -0.15820907196729422,
'key[T.7]:speechiness': -0.13415630717420718,
'acousticness': 0.17087436844976686,
'playlist_genre[T.pop]:acousticness': -0.1691862440953791,
'playlist_genre[T.r&b]:acousticness': -0.15272166298122875,
'playlist_genre[T.rock]:acousticness': -0.29354175255229853,
'instrumentalness': -0.19774618445032627,
'key[T.11]:instrumentalness': 0.16289813001903056,
'playlist_genre[T.rock]:liveness': -0.17882547702557045,
'mode[T.1]:liveness': 0.08418855173409909,
'playlist_genre[T.r&b]:valence': -0.3002306480191845,
'playlist_genre[T.rap]:valence': -0.21132225970349322,
'duration_ms': -0.13982442258401978,
'playlist_genre[T.latin]:duration_ms': 0.24701972934680702,
'playlist_genre[T.rock]:duration_ms': 0.2239857676255843,
'key[T.8]:duration_ms': -0.17156041337994027,
'key[T.9]:duration_ms': -0.17927816063665708,
'key[T.11]:duration_ms': -0.16265612640241334}
model_6_results.top_2_features[0]
['Intercept', 'playlist_genre[T.rock]', 'playlist_genre[T.pop]']
The two statistically significant features with the highest magnitude are the and playlist_genre[T.rock] (around 1.07) and playlist_genre[T.pop] (around 0.97). We also see many significant interaction terms in this model. This suggests that the effect of an audio feature (such as danceability, etc.) on a track's popularity is not universal and it may be changing depending on the song's genre and musical key.
This model shows better performance than Models 1-5. The ROC AUC improved to 0.68, and the F1 Score saw a substantial increase to 0.25, indicating a much better balance between precision and sensitivity. With sensitivity improving to around 0.16, this model is better at identifying popular songs. At the same time specificity decreased to 0.94, so the model makes more false positive errors to capture more true positives.
Model 7. Additional model #1: Interaction between playlist_genre and a targeted subset of continuous features¶
formula_list[6]
'binary_outcome ~ playlist_genre * (danceability + energy + speechiness + instrumentalness + valence + tempo)'
fit_glm_7 = smf.logit(formula=formula_list[6], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.603147
Iterations 6
df_modeling_copy['pred_probability_M7'] = fit_glm_7.predict(df_modeling)
df_modeling_copy['pred_class_M7'] = np.where(df_modeling_copy.pred_probability_M7 > 0.5, 1, 0)
fig, ax = plt.subplots()
sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M7, margins=True),
annot=True, annot_kws={'size': 20}, fmt='3d')
plt.show()
model_7_results = fit_and_analyze_logistic(mod_name='Model 7', a_formula=formula_list[6], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.603147
Iterations 6
model_7_results
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 7 | binary_outcome ~ playlist_genre * (danceabilit... | 42 | 21 | {'Intercept': -1.3781378385906253, 'playlist_g... | [Intercept, playlist_genre[T.pop], playlist_ge... | 0.671338 | 0.033172 | 0.483245 | 0.982693 | 0.017307 | 0.062082 | 0.642028 |
Model 7, which includes interaction between playlist_genre and six continuous input variables, has 42 coefficients, half of which are statistically significant. Below is the full list of all statistically significant features and their values:
model_7_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.3781378385906253,
'playlist_genre[T.latin]': 0.7102180337256131,
'playlist_genre[T.pop]': 0.9740358514667813,
'playlist_genre[T.r&b]': 0.4004436020311392,
'playlist_genre[T.rap]': 0.7063598154232721,
'playlist_genre[T.rock]': 0.9671501258864602,
'playlist_genre[T.pop]:danceability': 0.20939032310783862,
'playlist_genre[T.rap]:danceability': 0.3628786991383162,
'playlist_genre[T.rock]:danceability': 0.16626109811198622,
'energy': -0.12705703967201093,
'playlist_genre[T.pop]:speechiness': 0.14540488968728038,
'instrumentalness': -0.29720189685181725,
'playlist_genre[T.pop]:instrumentalness': -0.11078544361206202,
'playlist_genre[T.rap]:instrumentalness': 0.2842712561135303,
'playlist_genre[T.rock]:instrumentalness': 0.16163091120174888,
'valence': 0.13730002794055163,
'playlist_genre[T.latin]:valence': -0.1417282127383832,
'playlist_genre[T.pop]:valence': -0.14974500479916417,
'playlist_genre[T.r&b]:valence': -0.39258136414365397,
'playlist_genre[T.rap]:valence': -0.2682237316365914,
'playlist_genre[T.rock]:valence': -0.16352402014638404}
model_7_results.top_2_features[0]
['Intercept', 'playlist_genre[T.pop]', 'playlist_genre[T.rock]']
The main effects for playlist_genre[T.pop] (around 0.97) and playlist_genre[T.rock] (around 0.97) remain the features with the highest magnitude. The significant interaction terms reveal more specific relationships; for example, the positive coefficient for playlist_genre[T.rap]:danceability suggests that danceability has a stronger positive effect on popularity for rap songs compared to the edm genre (reference group).
This simpler model represents a significant decrease in performance from the more comprehensive interaction model (Model 6). The ROC AUC dropped to 0.64, and the F1 Score fell sharply to 0.06. This is driven by the sensitivity returning to a low level of 0.03, indicating the model has poor ability to identify popular songs. This suggests that the features and interactions removed from this model were important for its predictive power.
Model 8. Additional model #2: cyclical (sine) tempo feature interacting with with playlist_genre interaction, plus main effects of other key audio features¶
formula_list[7]
'binary_outcome ~ playlist_genre * np.sin(tempo) + danceability + energy + acousticness'
fit_glm_8 = smf.logit(formula=formula_list[7], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.611414
Iterations 6
df_modeling_copy['pred_probability_M8'] = fit_glm_8.predict(df_modeling)
df_modeling_copy['pred_class_M8'] = np.where(df_modeling_copy.pred_probability_M8 > 0.5, 1, 0)
fig, ax = plt.subplots()
sns.heatmap(pd.crosstab(df_modeling_copy.binary_outcome, df_modeling_copy.pred_class_M8, margins=True),
annot=True, annot_kws={'size': 20}, fmt='3d')
plt.show()
model_8_results = fit_and_analyze_logistic(mod_name='Model 8', a_formula=formula_list[7], train_data=df_modeling, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.611414
Iterations 6
model_8_results
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 8 | binary_outcome ~ playlist_genre * np.sin(tempo... | 15 | 9 | {'Intercept': -1.5459646472786424, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.671616 | 0.007143 | 0.453846 | 0.995806 | 0.004194 | 0.014064 | 0.618848 |
Model 8 was designed to test for a cyclical relationship between a song's tempo and its popularity. It has 15 coefficients, 9 of which are statistically significant. Below is the full list of all statistically significant features and their values:
model_8_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.5459646472786424,
'playlist_genre[T.latin]': 0.9230082780788055,
'playlist_genre[T.pop]': 1.153610253068739,
'playlist_genre[T.r&b]': 0.653154735102794,
'playlist_genre[T.rap]': 0.9299190469592341,
'playlist_genre[T.rock]': 1.2374639439449069,
'danceability': 0.09675689671802463,
'energy': -0.06217633612842594,
'acousticness': 0.13552851281491385}
model_8_results.top_2_features[0]
['Intercept', 'playlist_genre[T.rock]', 'playlist_genre[T.pop]']
The key finding is that this model's central hypothesis was not supported by the data, as the cyclical tempo feature and its interactions were not statistically significant. Instead, the two most impactful predictors were the main effects for the genres, with rock (around 1.24) and pop (around 1.15) again showing the strongest positive association with popularity.
The performance of this model is worse than other interaction models. The ROC AUC of 0.62 is low, and the model almost completely fails to identify popular songs, as shown by the near-zero sensitivity (around 0.007) and F1 Score (0.014). The low performance of the model confirms that this exploratory approach was not successful.
Comparing the models' performance on the training set¶
results_list = []
for m in range(len(formula_list)):
results_list.append( fit_and_analyze_logistic(m+1, formula_list[m], train_data = df_modeling, threshold = 0.5) )
Optimization terminated successfully.
Current function value: 0.632687
Iterations 4
Optimization terminated successfully.
Current function value: 0.615117
Iterations 6
Optimization terminated successfully.
Current function value: 0.613043
Iterations 5
Optimization terminated successfully.
Current function value: 0.598450
Iterations 6
Optimization terminated successfully.
Current function value: 0.604714
Iterations 6
Optimization terminated successfully.
Current function value: 0.588672
Iterations 6
Optimization terminated successfully.
Current function value: 0.603147
Iterations 6
Optimization terminated successfully.
Current function value: 0.611414
Iterations 6
results_df = pd.concat(results_list, ignore_index=True)
results_df.sort_values(by=['Accuracy'], ascending=False)
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 36 | {'Intercept': -1.4692198939587524, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.681620 | 0.159927 | 0.549958 | 0.936149 | 0.063851 | 0.247796 | 0.678517 |
| 4 | 5 | binary_outcome ~ (danceability + energy + loud... | 56 | 27 | {'Intercept': -0.7532051680487379, 'energy': -... | [Intercept, energy, loudness] | 0.675347 | 0.065375 | 0.541082 | 0.972947 | 0.027053 | 0.116656 | 0.637941 |
| 0 | 1 | binary_outcome ~ 1 | 1 | 1 | {'Intercept': -0.717662608612157} | [Intercept] | 0.672092 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.500000 |
| 1 | 2 | binary_outcome ~ playlist_genre + key + mode | 18 | 7 | {'Intercept': -1.633228568540369, 'playlist_ge... | [Intercept, playlist_genre[T.pop], playlist_ge... | 0.672092 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.599069 |
| 7 | 8 | binary_outcome ~ playlist_genre * np.sin(tempo... | 15 | 9 | {'Intercept': -1.5459646472786424, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.671616 | 0.007143 | 0.453846 | 0.995806 | 0.004194 | 0.014064 | 0.618848 |
| 6 | 7 | binary_outcome ~ playlist_genre * (danceabilit... | 42 | 21 | {'Intercept': -1.3781378385906253, 'playlist_g... | [Intercept, playlist_genre[T.pop], playlist_ge... | 0.671338 | 0.033172 | 0.483245 | 0.982693 | 0.017307 | 0.062082 | 0.642028 |
| 2 | 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 10 | {'Intercept': -0.7548636863231273, 'energy': -... | [Intercept, energy, loudness] | 0.670822 | 0.027966 | 0.467611 | 0.984465 | 0.015535 | 0.052776 | 0.617639 |
| 3 | 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 15 | {'Intercept': -1.5917205139340989, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.669075 | 0.084988 | 0.474324 | 0.954046 | 0.045954 | 0.144148 | 0.656647 |
The most complex model - Model 6 - stands as the most accurate. However, when evaluating model performance, the ROC AUC is the most reliable metric for this dataset due to the significant class imbalance (67% of songs are unpopular). Because accuracy can be misleading, so I'm going to focus on the ROC AUC score.
results_df.sort_values(by=['ROC_AUC'], ascending=False)
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 36 | {'Intercept': -1.4692198939587524, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.681620 | 0.159927 | 0.549958 | 0.936149 | 0.063851 | 0.247796 | 0.678517 |
| 3 | 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 15 | {'Intercept': -1.5917205139340989, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.669075 | 0.084988 | 0.474324 | 0.954046 | 0.045954 | 0.144148 | 0.656647 |
| 6 | 7 | binary_outcome ~ playlist_genre * (danceabilit... | 42 | 21 | {'Intercept': -1.3781378385906253, 'playlist_g... | [Intercept, playlist_genre[T.pop], playlist_ge... | 0.671338 | 0.033172 | 0.483245 | 0.982693 | 0.017307 | 0.062082 | 0.642028 |
| 4 | 5 | binary_outcome ~ (danceability + energy + loud... | 56 | 27 | {'Intercept': -0.7532051680487379, 'energy': -... | [Intercept, energy, loudness] | 0.675347 | 0.065375 | 0.541082 | 0.972947 | 0.027053 | 0.116656 | 0.637941 |
| 7 | 8 | binary_outcome ~ playlist_genre * np.sin(tempo... | 15 | 9 | {'Intercept': -1.5459646472786424, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.671616 | 0.007143 | 0.453846 | 0.995806 | 0.004194 | 0.014064 | 0.618848 |
| 2 | 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 10 | {'Intercept': -0.7548636863231273, 'energy': -... | [Intercept, energy, loudness] | 0.670822 | 0.027966 | 0.467611 | 0.984465 | 0.015535 | 0.052776 | 0.617639 |
| 1 | 2 | binary_outcome ~ playlist_genre + key + mode | 18 | 7 | {'Intercept': -1.633228568540369, 'playlist_ge... | [Intercept, playlist_genre[T.pop], playlist_ge... | 0.672092 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.599069 |
| 0 | 1 | binary_outcome ~ 1 | 1 | 1 | {'Intercept': -0.717662608612157} | [Intercept] | 0.672092 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.500000 |
def fit_logistic_make_roc (mod_name, a_formula, train_data):
a_mod = smf.logit(formula=a_formula, data=train_data).fit()
train_copy = train_data.copy()
train_copy['pred_probability'] = a_mod.predict(train_data)
fpr, tpr, threshold = roc_curve(train_data.binary_outcome.to_numpy(), train_copy.pred_probability.to_numpy())
res_df = pd.DataFrame({'tpr': tpr,
'fpr': fpr,
'threshold': threshold})
res_df['model_name'] = mod_name
res_df['model_formula'] = a_formula
return res_df
roc_list = []
for m in range( len(formula_list) ):
roc_list.append( fit_logistic_make_roc(m+1, formula_list[m], train_data=df_modeling) )
Optimization terminated successfully.
Current function value: 0.632687
Iterations 4
Optimization terminated successfully.
Current function value: 0.615117
Iterations 6
Optimization terminated successfully.
Current function value: 0.613043
Iterations 5
Optimization terminated successfully.
Current function value: 0.598450
Iterations 6
Optimization terminated successfully.
Current function value: 0.604714
Iterations 6
Optimization terminated successfully.
Current function value: 0.588672
Iterations 6
Optimization terminated successfully.
Current function value: 0.603147
Iterations 6
Optimization terminated successfully.
Current function value: 0.611414
Iterations 6
roc_df = pd.concat(roc_list, ignore_index=True)
roc_df['model_name'] = roc_df.model_name.astype('category')
roc_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 64115 entries, 0 to 64114 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tpr 64115 non-null float64 1 fpr 64115 non-null float64 2 threshold 64115 non-null float64 3 model_name 64115 non-null category 4 model_formula 64115 non-null object dtypes: category(1), float64(3), object(1) memory usage: 2.0+ MB
sns.relplot(data=roc_df, x='fpr', y='tpr', hue='model_name',
kind='line', estimator=None, units='model_name',
col='model_name', col_wrap=4)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The ROC curve for the intercept-only Model 1 is a 45-degree diagonal line, which is the expected result for a baseline model with no discriminative ability. Because it assigns the same probability to every song, it cannot distinguish between the positive and negative classes, resulting in an AUC of 0.5.
In stark contrast, Model 6, the most complex model with 198 coefficients, has the curviest line - it is pushed furthest towards the top - left corner of the plot. This visually confirms it has the best performance on the training data, achieving the highest True Positive Rate for any given False Positive Rate, and thus the highest ROC AUC score. The other models" curves fall in between these two extremes, generally showing improved performance as more features and complexity are added.
Which model has the best performance on the training set?¶
Based on the ROC AUC, Model 6 (the most complex model with the highest number of coefficients) has the best performance on the training set with a score of around 0.68. This indicates it has the strongest ability to distinguish between popular and unpopular songs out of all models tested.
Is the best model different when considering Accuracy vs ROC AUC?¶
No, in this case, the best model is the same for both metrics. Model 6 also has the highest accuracy. However, ROC AUC is the better measure of performance here, as the high accuracy score is inflated by the model's ability to correctly guess the majority "unpopular" class.
Is the best model better than the INTERCEPT-ONLY model?¶
Yes, the best model is significantly better. Model 6's ROC AUC of roughly 0.68 demonstrates a clear improvement in predictive power over the intercept-only model's score of 0.5, which is equivalent to random chance.
How many coefficients are associated with the BEST model?¶
The best-performing model, Model 6, is the most complicated one and has 198 coefficients.
The model with all inputs and linear additive features is Model 4 and the model that peformed best on the trainig set is Model 6. I will use them for predictions on the new data.
model_4_results.model_formula[0]
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + playlist_genre + key + mode'
model_6_results.model_formula[0]
'binary_outcome ~ (playlist_genre + key + mode) * (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)'
model_4_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.5917205139340989,
'playlist_genre[T.latin]': 0.8710920712060213,
'playlist_genre[T.pop]': 1.0828670623658054,
'playlist_genre[T.r&b]': 0.6345725938683296,
'playlist_genre[T.rap]': 0.8857224431014801,
'playlist_genre[T.rock]': 1.4025550042702633,
'danceability': 0.11759980055645763,
'energy': -0.2951494814511126,
'loudness': 0.33056018693286277,
'speechiness': -0.047355152773171635,
'acousticness': 0.10518289624176624,
'instrumentalness': -0.1609632391131655,
'valence': -0.0459921965724886,
'tempo': 0.08061844635555175,
'duration_ms': -0.16591935884430412}
model_6_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.4692198939587524,
'playlist_genre[T.latin]': 0.7129044076828355,
'playlist_genre[T.pop]': 0.9697884006457378,
'playlist_genre[T.r&b]': 0.5410788520732197,
'playlist_genre[T.rap]': 0.683108840753402,
'playlist_genre[T.rock]': 1.0648959049386686,
'playlist_genre[T.pop]:danceability': 0.16260997561869103,
'playlist_genre[T.rap]:danceability': 0.32166916627430203,
'energy': -0.30098777260781245,
'key[T.5]:energy': 0.28262384287694486,
'key[T.10]:energy': 0.2675946672055558,
'loudness': 0.22268063960026133,
'playlist_genre[T.latin]:loudness': 0.308053273645286,
'playlist_genre[T.pop]:loudness': 0.2308701655335041,
'playlist_genre[T.r&b]:loudness': 0.31156778841244914,
'key[T.5]:loudness': -0.2276773532572789,
'playlist_genre[T.pop]:speechiness': 0.141346409361902,
'key[T.4]:speechiness': -0.14989435832592057,
'key[T.5]:speechiness': -0.15820907196729422,
'key[T.7]:speechiness': -0.13415630717420718,
'acousticness': 0.17087436844976686,
'playlist_genre[T.pop]:acousticness': -0.1691862440953791,
'playlist_genre[T.r&b]:acousticness': -0.15272166298122875,
'playlist_genre[T.rock]:acousticness': -0.29354175255229853,
'instrumentalness': -0.19774618445032627,
'key[T.11]:instrumentalness': 0.16289813001903056,
'playlist_genre[T.rock]:liveness': -0.17882547702557045,
'mode[T.1]:liveness': 0.08418855173409909,
'playlist_genre[T.r&b]:valence': -0.3002306480191845,
'playlist_genre[T.rap]:valence': -0.21132225970349322,
'duration_ms': -0.13982442258401978,
'playlist_genre[T.latin]:duration_ms': 0.24701972934680702,
'playlist_genre[T.rock]:duration_ms': 0.2239857676255843,
'key[T.8]:duration_ms': -0.17156041337994027,
'key[T.9]:duration_ms': -0.17927816063665708,
'key[T.11]:duration_ms': -0.16265612640241334}
In both models, the three most important statistically significant inputs are playlist_genre, energy, and loudness. When comparing the main effects for energy and loudness across both Model 4 and Model 6, the loudness coefficient in Model 4 has the single highest magnitude (around 0.33). Therefore, it was selected as the most impactful continuous feature of the two.
input_grid = pd.DataFrame([(danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo,
duration_ms, playlist_genre, key, mode)
for danceability in [df_modeling.danceability.mean()]
for energy in np.linspace(df_modeling.energy.min(), df_modeling.energy.max(), num=5)
for loudness in np.linspace(df_modeling.loudness.min(), df_modeling.loudness.max(), num=101)
for speechiness in [df_modeling.speechiness.mean()]
for acousticness in [df_modeling.acousticness.mean()]
for instrumentalness in [df_modeling.instrumentalness.mean()]
for liveness in [df_modeling.liveness.mean()]
for valence in [df_modeling.valence.mean()]
for tempo in [df_modeling.tempo.mean()]
for duration_ms in [df_modeling.duration_ms.mean()]
for playlist_genre in df_modeling.playlist_genre.unique()
for key in df_modeling['key'].mode()
for mode in df_modeling['mode'].mode()],
columns=['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
'valence', 'tempo', 'duration_ms', 'playlist_genre', 'key', 'mode'])
input_grid.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3030 entries, 0 to 3029 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 danceability 3030 non-null float64 1 energy 3030 non-null float64 2 loudness 3030 non-null float64 3 speechiness 3030 non-null float64 4 acousticness 3030 non-null float64 5 instrumentalness 3030 non-null float64 6 liveness 3030 non-null float64 7 valence 3030 non-null float64 8 tempo 3030 non-null float64 9 duration_ms 3030 non-null float64 10 playlist_genre 3030 non-null object 11 key 3030 non-null int64 12 mode 3030 non-null int64 dtypes: float64(10), int64(2), object(1) memory usage: 307.9+ KB
input_grid.nunique()
danceability 1 energy 5 loudness 101 speechiness 1 acousticness 1 instrumentalness 1 liveness 1 valence 1 tempo 1 duration_ms 1 playlist_genre 6 key 1 mode 1 dtype: int64
input_grid.playlist_genre.value_counts()
playlist_genre pop 505 rap 505 rock 505 latin 505 r&b 505 edm 505 Name: count, dtype: int64
Making predictions on the new data¶
dfviz = input_grid.copy()
dfviz['pred_probability_M4'] = fit_glm_4.predict(input_grid)
dfviz['pred_probability_M6'] = fit_glm_6.predict(input_grid)
dfviz
| danceability | energy | loudness | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | playlist_genre | key | mode | pred_probability_M4 | pred_probability_M6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.810911e-16 | -3.754985 | -6.293843 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | pop | 1 | 1 | 0.190371 | 0.053369 |
| 1 | 1.810911e-16 | -3.754985 | -6.293843 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | rap | 1 | 1 | 0.161821 | 0.202214 |
| 2 | 1.810911e-16 | -3.754985 | -6.293843 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | rock | 1 | 1 | 0.244547 | 0.168505 |
| 3 | 1.810911e-16 | -3.754985 | -6.293843 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | latin | 1 | 1 | 0.159846 | 0.041464 |
| 4 | 1.810911e-16 | -3.754985 | -6.293843 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | r&b | 1 | 1 | 0.130574 | 0.035059 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3025 | 1.810911e-16 | 1.631404 | 10.720326 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | rap | 1 | 1 | 0.916037 | 0.847017 |
| 3026 | 1.810911e-16 | 1.631404 | 10.720326 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | rock | 1 | 1 | 0.948167 | 0.943913 |
| 3027 | 1.810911e-16 | 1.631404 | 10.720326 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | latin | 1 | 1 | 0.914904 | 0.995504 |
| 3028 | 1.810911e-16 | 1.631404 | 10.720326 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | r&b | 1 | 1 | 0.894592 | 0.994818 |
| 3029 | 1.810911e-16 | 1.631404 | 10.720326 | 4.377778e-16 | -1.669874e-16 | 1.015464e-16 | -1.545762e-16 | 1.083162e-16 | 3.430012e-16 | -8.123712e-17 | edm | 1 | 1 | 0.818163 | 0.835694 |
3030 rows × 15 columns
Visualizing event probability¶
The line plot below shows the probability estimated by Model 4 on the new data (input_grid). It is colored by 5 distinct values of energy and each facet represents a playlist_genre.
sns.relplot(data=dfviz, x='loudness', y='pred_probability_M4', hue='energy', col='playlist_genre',
kind='line', estimator=None, units='energy', palette='icefire',
col_wrap=3)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
For each playlist genre, the lines are upward-sloping, meaning that as loudness goes up, the probability of a track being popular increases. We can also see that as energy goes down, the probability of the track being popular increases. The shape of the curve and the spacing between the lines representing different energy levels are generally consistent across genres. This shows that Model 4 assumes the effect of each feature is the same regardless of the song's genre.
model_4_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.5917205139340989,
'playlist_genre[T.latin]': 0.8710920712060213,
'playlist_genre[T.pop]': 1.0828670623658054,
'playlist_genre[T.r&b]': 0.6345725938683296,
'playlist_genre[T.rap]': 0.8857224431014801,
'playlist_genre[T.rock]': 1.4025550042702633,
'danceability': 0.11759980055645763,
'energy': -0.2951494814511126,
'loudness': 0.33056018693286277,
'speechiness': -0.047355152773171635,
'acousticness': 0.10518289624176624,
'instrumentalness': -0.1609632391131655,
'valence': -0.0459921965724886,
'tempo': 0.08061844635555175,
'duration_ms': -0.16591935884430412}
The plot below shows the event probabilities predicted by Model 6 on the new data. Each line corresponds to a different energy level, and the facets separate the predictions across the six playlist genres.
sns.relplot(data=dfviz, x='loudness', y='pred_probability_M6', hue='energy', col='playlist_genre',
kind='line', estimator=None, units='energy', palette='icefire',
col_wrap=3)
plt.show()
C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Similar to Model 4, the plots for Model 6 show that higher loudness and lower energy are generally associated with a higher probability of being popular. However, the key difference is that the shapes of these relationships now change for each genre.
The steepness of the lines varies noticeably - they are steepest for pop, indicating that loudness has the strongest positive impact on popularity in that genre, but are flattest for rap, meaning loudness has a much weaker effect for rap music.
We also observe larger differences in the spacing between the lines. For instance, the distance is largest for the rap genre and shortest for pop. This shows that the negative effect of energy on popularity is most pronounced for rap and least impactful for pop.
The main trend is that Model 6 learns how an audio feature's importance changes for each specific genre, as opposed to Model 4's more rigid approach.
In the plots for Model 6, we can see that for some genres, like pop, latin, and r&b, the predicted probabilities get much closer to 1 compared to other genres, like edm. This suggests that the model is much more certain in predictions for those three genres. Model 4, in contrast, doesn't show the same level of variation in certainty. The varied reliability of Model 6 allows us to trust its predictions more for some categories than for others.
model_6_results['sign_coefs_&_their_values'].iloc[0]
{'Intercept': -1.4692198939587524,
'playlist_genre[T.latin]': 0.7129044076828355,
'playlist_genre[T.pop]': 0.9697884006457378,
'playlist_genre[T.r&b]': 0.5410788520732197,
'playlist_genre[T.rap]': 0.683108840753402,
'playlist_genre[T.rock]': 1.0648959049386686,
'playlist_genre[T.pop]:danceability': 0.16260997561869103,
'playlist_genre[T.rap]:danceability': 0.32166916627430203,
'energy': -0.30098777260781245,
'key[T.5]:energy': 0.28262384287694486,
'key[T.10]:energy': 0.2675946672055558,
'loudness': 0.22268063960026133,
'playlist_genre[T.latin]:loudness': 0.308053273645286,
'playlist_genre[T.pop]:loudness': 0.2308701655335041,
'playlist_genre[T.r&b]:loudness': 0.31156778841244914,
'key[T.5]:loudness': -0.2276773532572789,
'playlist_genre[T.pop]:speechiness': 0.141346409361902,
'key[T.4]:speechiness': -0.14989435832592057,
'key[T.5]:speechiness': -0.15820907196729422,
'key[T.7]:speechiness': -0.13415630717420718,
'acousticness': 0.17087436844976686,
'playlist_genre[T.pop]:acousticness': -0.1691862440953791,
'playlist_genre[T.r&b]:acousticness': -0.15272166298122875,
'playlist_genre[T.rock]:acousticness': -0.29354175255229853,
'instrumentalness': -0.19774618445032627,
'key[T.11]:instrumentalness': 0.16289813001903056,
'playlist_genre[T.rock]:liveness': -0.17882547702557045,
'mode[T.1]:liveness': 0.08418855173409909,
'playlist_genre[T.r&b]:valence': -0.3002306480191845,
'playlist_genre[T.rap]:valence': -0.21132225970349322,
'duration_ms': -0.13982442258401978,
'playlist_genre[T.latin]:duration_ms': 0.24701972934680702,
'playlist_genre[T.rock]:duration_ms': 0.2239857676255843,
'key[T.8]:duration_ms': -0.17156041337994027,
'key[T.9]:duration_ms': -0.17927816063665708,
'key[T.11]:duration_ms': -0.16265612640241334}
results_df.sort_values(by=['num_coefs'], ascending=False)
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 36 | {'Intercept': -1.4692198939587524, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.681620 | 0.159927 | 0.549958 | 0.936149 | 0.063851 | 0.247796 | 0.678517 |
| 4 | 5 | binary_outcome ~ (danceability + energy + loud... | 56 | 27 | {'Intercept': -0.7532051680487379, 'energy': -... | [Intercept, energy, loudness] | 0.675347 | 0.065375 | 0.541082 | 0.972947 | 0.027053 | 0.116656 | 0.637941 |
| 6 | 7 | binary_outcome ~ playlist_genre * (danceabilit... | 42 | 21 | {'Intercept': -1.3781378385906253, 'playlist_g... | [Intercept, playlist_genre[T.pop], playlist_ge... | 0.671338 | 0.033172 | 0.483245 | 0.982693 | 0.017307 | 0.062082 | 0.642028 |
| 3 | 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 15 | {'Intercept': -1.5917205139340989, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.669075 | 0.084988 | 0.474324 | 0.954046 | 0.045954 | 0.144148 | 0.656647 |
| 1 | 2 | binary_outcome ~ playlist_genre + key + mode | 18 | 7 | {'Intercept': -1.633228568540369, 'playlist_ge... | [Intercept, playlist_genre[T.pop], playlist_ge... | 0.672092 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.599069 |
| 7 | 8 | binary_outcome ~ playlist_genre * np.sin(tempo... | 15 | 9 | {'Intercept': -1.5459646472786424, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.671616 | 0.007143 | 0.453846 | 0.995806 | 0.004194 | 0.014064 | 0.618848 |
| 2 | 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 10 | {'Intercept': -0.7548636863231273, 'energy': -... | [Intercept, energy, loudness] | 0.670822 | 0.027966 | 0.467611 | 0.984465 | 0.015535 | 0.052776 | 0.617639 |
| 0 | 1 | binary_outcome ~ 1 | 1 | 1 | {'Intercept': -0.717662608612157} | [Intercept] | 0.672092 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.500000 |
formula_list[5]
'binary_outcome ~ (playlist_genre + key + mode) * (danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms)'
formula_list[2]
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms'
formula_list[3]
'binary_outcome ~ danceability + energy + loudness + speechiness + acousticness + instrumentalness + liveness + valence + tempo + duration_ms + playlist_genre + key + mode'
I selected three models for cross-validation:
- Model 6 - the model that has the best performance on the training set. It includes interaction between categorical and continuous inputs (including main effects)
- Model 3 - a model that includes continuous inputs with linear additive features has just a few features (11)
- Model 4 - a model that includes all inputs (continuous and categorical) with linear additive features. It's medium-to-high complexity model with 28 features.
Perform cross-validation and calculate performance metrics¶
from sklearn.model_selection import StratifiedKFold
I will use 5-fold cross-validation.
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=101)
I will fit the models with statsmodels.
In Section D at the preprocessing stage, I created a dataset df_numeric_inputs that includes only continuous input variables. I also performed log and cube root transformations on those variables whose distributions were strongly skewed. Thus, df_numeric_inputs is a dataset which already contains log/cube root transformed numeric input variables.
df_copy is a cleaned dataset containing all variables (including those that were not used in the analysis).
I will create the dataset df_CV, a list input_names, and an object output name that will be used in the arguments for the formula that will help streamline the process of cross-validation, standardization, and performance metrics calculation.
df_CV = df_numeric_inputs.copy()
df_CV['playlist_genre'] = df_copy.playlist_genre
df_CV['mode'] = df_copy['mode']
df_CV['key'] = df_copy.key
df_CV['binary_outcome'] = df_copy.binary_outcome.astype(int)
input_names = df_CV.drop(columns=['binary_outcome']).columns.tolist()
input_names
['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'playlist_genre', 'mode', 'key']
output_name = 'binary_outcome'
df_CV.info()
<class 'pandas.core.frame.DataFrame'> Index: 25190 entries, 3 to 32832 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 danceability 25190 non-null float64 1 energy 25190 non-null float64 2 loudness 25190 non-null float64 3 speechiness 25190 non-null float64 4 acousticness 25190 non-null float64 5 instrumentalness 25190 non-null float64 6 liveness 25190 non-null float64 7 valence 25190 non-null float64 8 tempo 25190 non-null float64 9 duration_ms 25190 non-null int64 10 playlist_genre 25190 non-null object 11 mode 25190 non-null object 12 key 25190 non-null object 13 binary_outcome 25190 non-null int32 dtypes: float64(9), int32(1), int64(1), object(3) memory usage: 3.3+ MB
The function below automates the process of model training and evaluation by:
- Splitting the dataset into five folds,
- Standardizing continuous variables within each fold,
- Fitting a logistic regression model on the training set,
- Predicting outcomes for the test set, and
- Calculating performance metrics, including accuracy, sensitivity, precision, specificity, FPR, F1 score, and ROC AUC.
def train_test_and_asses_logistic_with_cv(mod_name, a_formula, data_df, x_names, y_name, cv, threshold=0.5):
# Initialize a list for each performance metric to store the results from each fold
accuracy_res = []
sensitivity_res = []
precision_res = []
specificity_res = []
fpr_res = []
f1_res = []
roc_auc_res = []
# Get the names of the continuous variables that need standardizing
continuous_vars = data_df[x_names].select_dtypes(include=np.number).columns.tolist()
# Split the data and iterate over the folds
for train_id, test_id in cv.split( data_df.to_numpy(), data_df[y_name].to_numpy() ):
# subset the training and test splits within each fold
train_data = data_df.iloc[train_id].copy()
test_data = data_df.iloc[test_id].copy()
# Standardize the data within each split
scaler = StandardScaler()
# Fit the scaler only on the training data to avoid data leakage
scaler.fit(train_data[continuous_vars])
# Use the fitted scaler to transform both the training and testing data
train_data[continuous_vars] = scaler.transform(train_data[continuous_vars])
test_data[continuous_vars] = scaler.transform(test_data[continuous_vars])
# Fit the model on the training data within the current fold
a_mod = smf.logit(formula=a_formula, data=train_data).fit()
# Predict the test dataset within each fold
test_copy = test_data.copy()
test_copy['pred_probability'] = a_mod.predict(test_data)
test_copy['pred_class'] = np.where(test_copy.pred_probability > threshold, 1, 0)
# Calculate all performance metrics on the testing set
TN, FP, FN, TP = confusion_matrix(test_copy[y_name].to_numpy(), test_copy.pred_class.to_numpy() ).ravel()
Accuracy = (TN + TP) / (TN + FP + FN + TP)
Sensitivity = TP / (TP + FN)
Precision = TP / (TP + FP) if (TP + FP) > 0 else 0
Specificity = TN / (TN + FP)
FPR = 1 - Specificity
F1_Score = 2 * (Precision * Sensitivity) / (Precision + Sensitivity) if (Precision + Sensitivity) > 0 else 0
ROC_AUC = roc_auc_score(test_copy[y_name].to_numpy(), test_copy.pred_probability.to_numpy())
# Append the results for this fold to our lists
accuracy_res.append(Accuracy)
sensitivity_res.append(Sensitivity)
precision_res.append(Precision)
specificity_res.append(Specificity)
fpr_res.append(FPR)
f1_res.append(F1_Score)
roc_auc_res.append(ROC_AUC)
# Bookkeeping to store the results
results_dict = {'model_name': mod_name,
'model_formula': a_formula,
'num_coefs': len(a_mod.params),
'fold_id': list(range(1, len(accuracy_res) + 1)),
'Accuracy': accuracy_res,
'Sensitivity': sensitivity_res,
'Precision': precision_res,
'Specificity': specificity_res,
'FPR': fpr_res,
'F1 Score': f1_res,
'ROC AUC': roc_auc_res}
# Convert the dictionary to a DataFrame
test_df = pd.DataFrame(results_dict)
return test_df
M3_CV = train_test_and_asses_logistic_with_cv('Model 3', formula_list[2], data_df=df_CV, x_names=input_names, y_name=output_name, cv=kf, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.612025
Iterations 5
Optimization terminated successfully.
Current function value: 0.612084
Iterations 5
Optimization terminated successfully.
Current function value: 0.613683
Iterations 5
Optimization terminated successfully.
Current function value: 0.612960
Iterations 5
Optimization terminated successfully.
Current function value: 0.614287
Iterations 5
M4_CV = train_test_and_asses_logistic_with_cv('Model 4', formula_list[3], data_df=df_CV, x_names=input_names, y_name=output_name, cv=kf, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.597774
Iterations 6
Optimization terminated successfully.
Current function value: 0.596139
Iterations 6
Optimization terminated successfully.
Current function value: 0.599665
Iterations 6
Optimization terminated successfully.
Current function value: 0.598338
Iterations 6
Optimization terminated successfully.
Current function value: 0.599724
Iterations 6
M6_CV = train_test_and_asses_logistic_with_cv('Model 6', formula_list[5], data_df=df_CV, x_names=input_names, y_name=output_name, cv=kf, threshold=0.5)
Optimization terminated successfully.
Current function value: 0.586741
Iterations 6
Optimization terminated successfully.
Current function value: 0.585734
Iterations 6
Optimization terminated successfully.
Current function value: 0.589036
Iterations 6
Optimization terminated successfully.
Current function value: 0.588041
Iterations 6
Optimization terminated successfully.
Current function value: 0.589298
Iterations 6
CV_results = pd.concat([M3_CV, M4_CV, M6_CV], ignore_index=True)
CV_results
| model_name | model_formula | num_coefs | fold_id | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 1 | 0.672291 | 0.033898 | 0.504505 | 0.983757 | 0.016243 | 0.063528 | 0.606414 |
| 1 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 2 | 0.667130 | 0.026634 | 0.389381 | 0.979622 | 0.020378 | 0.049858 | 0.605217 |
| 2 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 3 | 0.673283 | 0.029056 | 0.533333 | 0.987596 | 0.012404 | 0.055109 | 0.621767 |
| 3 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 4 | 0.670107 | 0.024818 | 0.445652 | 0.984938 | 0.015062 | 0.047018 | 0.617876 |
| 4 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 5 | 0.672489 | 0.024213 | 0.512821 | 0.988777 | 0.011223 | 0.046243 | 0.634369 |
| 5 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 1 | 0.666733 | 0.084746 | 0.456026 | 0.950679 | 0.049321 | 0.142930 | 0.649198 |
| 6 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 2 | 0.659389 | 0.083535 | 0.405882 | 0.940343 | 0.059657 | 0.138554 | 0.638116 |
| 7 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 3 | 0.675069 | 0.082930 | 0.528958 | 0.963969 | 0.036031 | 0.143380 | 0.664914 |
| 8 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 4 | 0.670306 | 0.098668 | 0.486567 | 0.949203 | 0.050797 | 0.164066 | 0.655030 |
| 9 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 5 | 0.669710 | 0.084746 | 0.479452 | 0.955109 | 0.044891 | 0.144033 | 0.663753 |
| 10 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 1 | 0.666733 | 0.149516 | 0.474088 | 0.919079 | 0.080921 | 0.227335 | 0.652764 |
| 11 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 2 | 0.666534 | 0.141646 | 0.471774 | 0.922623 | 0.077377 | 0.217877 | 0.650286 |
| 12 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 3 | 0.677848 | 0.140436 | 0.533333 | 0.940047 | 0.059953 | 0.222329 | 0.671652 |
| 13 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 4 | 0.676856 | 0.168281 | 0.522556 | 0.924985 | 0.075015 | 0.254579 | 0.664146 |
| 14 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 5 | 0.674276 | 0.154358 | 0.511022 | 0.927939 | 0.072061 | 0.237099 | 0.673748 |
sns.catplot(data=CV_results, x='model_name', y='Accuracy', kind='point', join=False)
plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\2990118830.py:1: UserWarning: The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`. sns.catplot(data=CV_results, x='model_name', y='Accuracy', kind='point', join=False) C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
df_CV.binary_outcome.value_counts(normalize=True)
binary_outcome 0 0.672092 1 0.327908 Name: proportion, dtype: float64
While Model 6 has the highest mean value of accuracy, its 95% CI largely overlaps with other models, indicating no statistically significant difference. More importantly, because the dataset is imbalanced (over 67% of songs are unpopular), accuracy is a misleading metric. Therefore, I will focus on ROC AUC, which provides a more reliable measure of a model's ability to distinguish between classes, regardless of the classification threshold.
ROC AUC¶
sns.catplot(data=CV_results, x='model_name', y='ROC AUC', kind='point', join=False)
plt.show()
C:\Users\adamo\AppData\Local\Temp\ipykernel_13400\932363862.py:1: UserWarning: The `join` parameter is deprecated and will be removed in v0.15.0. You can remove the line between points with `linestyle='none'`. sns.catplot(data=CV_results, x='model_name', y='ROC AUC', kind='point', join=False) C:\Users\adamo\anaconda3\envs\cmpinf2100\lib\site-packages\seaborn\axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The cross-validation results show that all three models have an average ROC AUC score significantly above 0.5, indicating they all possess predictive power better than random chance. While Model 6 has the highest mean ROC AUC, its 95% confidence interval has a substantial overlap with that of Model 4. This suggests there is no statistically significant difference in performance between these two models.
CV_results
| model_name | model_formula | num_coefs | fold_id | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 1 | 0.672291 | 0.033898 | 0.504505 | 0.983757 | 0.016243 | 0.063528 | 0.606414 |
| 1 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 2 | 0.667130 | 0.026634 | 0.389381 | 0.979622 | 0.020378 | 0.049858 | 0.605217 |
| 2 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 3 | 0.673283 | 0.029056 | 0.533333 | 0.987596 | 0.012404 | 0.055109 | 0.621767 |
| 3 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 4 | 0.670107 | 0.024818 | 0.445652 | 0.984938 | 0.015062 | 0.047018 | 0.617876 |
| 4 | Model 3 | binary_outcome ~ danceability + energy + loudn... | 11 | 5 | 0.672489 | 0.024213 | 0.512821 | 0.988777 | 0.011223 | 0.046243 | 0.634369 |
| 5 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 1 | 0.666733 | 0.084746 | 0.456026 | 0.950679 | 0.049321 | 0.142930 | 0.649198 |
| 6 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 2 | 0.659389 | 0.083535 | 0.405882 | 0.940343 | 0.059657 | 0.138554 | 0.638116 |
| 7 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 3 | 0.675069 | 0.082930 | 0.528958 | 0.963969 | 0.036031 | 0.143380 | 0.664914 |
| 8 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 4 | 0.670306 | 0.098668 | 0.486567 | 0.949203 | 0.050797 | 0.164066 | 0.655030 |
| 9 | Model 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 5 | 0.669710 | 0.084746 | 0.479452 | 0.955109 | 0.044891 | 0.144033 | 0.663753 |
| 10 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 1 | 0.666733 | 0.149516 | 0.474088 | 0.919079 | 0.080921 | 0.227335 | 0.652764 |
| 11 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 2 | 0.666534 | 0.141646 | 0.471774 | 0.922623 | 0.077377 | 0.217877 | 0.650286 |
| 12 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 3 | 0.677848 | 0.140436 | 0.533333 | 0.940047 | 0.059953 | 0.222329 | 0.671652 |
| 13 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 4 | 0.676856 | 0.168281 | 0.522556 | 0.924985 | 0.075015 | 0.254579 | 0.664146 |
| 14 | Model 6 | binary_outcome ~ (playlist_genre + key + mode)... | 198 | 5 | 0.674276 | 0.154358 | 0.511022 | 0.927939 | 0.072061 | 0.237099 | 0.673748 |
Which model is the BEST according to CROSS-VALIDATION?¶
To determine the best model, I will focus on the ROC AUC score. Given that the dataset is imbalanced (over 67% of songs being unpopular), Accuracy is a misleading metric. The ROC AUC provides a more reliable assessment of a model's ability to discriminate between classes.
The point plot of the cross-validation results shows that Model 6 has the highest average ROC AUC. However, its 95% confidence interval substantially overlaps with that of Model 4. This overlap indicates that there is no statistically significant difference in performance between the two models.
Therefore, following the the "simplest best" rule, Model 4 is the best model according to cross-validation. While its performance is similar to the much more complex Model 6, it achieves this with only 28 coefficients compared to Model 6's 198 features. This makes Model 4 less prone to overfitting and more easily interpretable.
Is this model DIFFERENT from the model identified as the BEST according to the training set?¶
Yes, the best model according to the cross-validation is Model 4 while the best model according to the training set is Model 6.
How many regression coefficients are associated with the best model?¶
Model 4 has 28 coefficients.
Differences between the best model's performance on the training set vs cross-validation¶
To see how performance of Model 4 changed from the training set to the cross-validation, I will compare its training set metrics with the averages for all metrics across all 5 folds.
results_df[results_df.model_name == 4]
| model_name | model_formula | num_coefs | num_sign_coefs | sign_coefs_&_their_values | top_2_features | Accuracy | Sensitivity | Precision | Specificity | FPR | F1 Score | ROC_AUC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 4 | binary_outcome ~ danceability + energy + loudn... | 28 | 15 | {'Intercept': -1.5917205139340989, 'playlist_g... | [Intercept, playlist_genre[T.rock], playlist_g... | 0.669075 | 0.084988 | 0.474324 | 0.954046 | 0.045954 | 0.144148 | 0.656647 |
model_4_results = CV_results[CV_results['model_name'] == 'Model 4']
M4_average_performance_CV = model_4_results[['Accuracy', 'Sensitivity', 'Precision', 'Specificity', 'FPR', 'F1 Score', 'ROC AUC']].mean()
M4_average_performance_CV
Accuracy 0.668241 Sensitivity 0.086925 Precision 0.471377 Specificity 0.951861 FPR 0.048139 F1 Score 0.146593 ROC AUC 0.654202 dtype: float64
The slight decrease in metrics like ROC AUC and Accuracy during cross-validation is indicates a minor degree of overfitting. We also see some small fluctuations in the other metrics which is expected. The model performed slightly better on the data it had already seen, and the cross-validation score is a more reliable assessment of its predictive ability.
Applying the best model on the entire dataset¶
To see the final set of coefficients for Model 4, which cross-validation identified as the best model, I re-fitted it on the entire preprocessed df_modeling dataset.
best_model = smf.logit(formula=formula_list[3], data=df_modeling).fit()
Optimization terminated successfully.
Current function value: 0.598450
Iterations 6
best_model.params
Intercept -1.591721 playlist_genre[T.latin] 0.871092 playlist_genre[T.pop] 1.082867 playlist_genre[T.r&b] 0.634573 playlist_genre[T.rap] 0.885722 playlist_genre[T.rock] 1.402555 key[T.1] 0.003998 key[T.2] -0.069688 key[T.3] -0.094121 key[T.4] -0.032777 key[T.5] 0.072833 key[T.6] 0.016002 key[T.7] -0.086394 key[T.8] 0.094110 key[T.9] -0.018013 key[T.10] 0.060207 key[T.11] -0.031106 mode[T.1] 0.029470 danceability 0.117600 energy -0.295149 loudness 0.330560 speechiness -0.047355 acousticness 0.105183 instrumentalness -0.160963 liveness -0.021375 valence -0.045992 tempo 0.080618 duration_ms -0.165919 dtype: float64
The statistically significant coefficients, ranked by the magnitude of their effect, are shown below. We see that playlist_genre, loudness, and energy have the biggest effect. However, we can also see that the effect of the playlist_genre is not universal across genres and are bigger for rock and pop compared to the edm baseline.
np.abs(best_model.params[best_model.pvalues < 0.05]).sort_values(ascending=False)
Intercept 1.591721 playlist_genre[T.rock] 1.402555 playlist_genre[T.pop] 1.082867 playlist_genre[T.rap] 0.885722 playlist_genre[T.latin] 0.871092 playlist_genre[T.r&b] 0.634573 loudness 0.330560 energy 0.295149 duration_ms 0.165919 instrumentalness 0.160963 danceability 0.117600 acousticness 0.105183 tempo 0.080618 speechiness 0.047355 valence 0.045992 dtype: float64